Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge and Data Engineering, 2008 Ozer Ozdikis Huseyin Candan 1 Extraction of User Profiles using ◦ Web Usage Data ◦ Web site hierarchy ◦ External data etc… Evolution of User Profiles in time ◦ Introducing new profiles, killing invalid ones… ◦ validation of the profile evolution 2 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 3 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 4 Key features of the paper Dynamic content (a portal for companies) Clustering of user sessions extracted from web logs into homogenous groups of similar activities Session similarity is calculated using navigated URL’s and website hierarchy (from URL and site taxonomy) Generate mass user profiles Repeat this generation periodically Track the changes between the previous profiles and new profiles, and evaluate their evolution 5 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 6 Web Usage Mining stages: 1. Collect usage data/clickstreams 2. Preprocess (reformat, filter irrelevant data) 3. Analyze and discover interesting patterns 4. Evaluate discovered profiles 5. Track the evolution of profiles Web Usage Mining has been used for personalization, predicting navigation patterns, building datacubes to apply OLAP etc… 7 Previous studies related with evolution ◦ Machine learning based (another dimension for learning evolving concepts) ◦ Time-based forgetting approaches ◦ Separate user profiles for short-term and long-term interests 8 Some concepts related to profile evolution ◦ Evolutionary / Revolutionary / Hybrid Learning regarding the adaptation to change ◦ No-memory / Partial Memory / Full Memory ◦ Supervised / Unsupervised ◦ Single user / mass user 9 For user modeling, web usage data can be supported with ◦ Keywords representing web page content ◦ Website’s hierarchical structure (different pages but semantically relevant, e.g. under the same group) ◦ Semantic enrichment of navigated URLs (semantically enhanced web logs -> C-Logs) ◦ Taxonomy can be “defined explicitly” or “inferred implicitly” via URL tokenization (http://a/b/c.htm) 10 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 11 Preprocess the weblog to identify sessions and produce their vector representations Produce profiles using H-UNC (Hierarchical Unsupervised Niche Clustering) -> a GA approach Enrich profiles with additional facets (external knowledge) Track profile evolution, and measure the validation of discovered profiles 12 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 13 Session Identification Sessions are extracted using just the weblog files (no login data, no cookies) Access time, IP Address, URL viewed, REFERRER are used for session identification Session Representation Each valid URL in the web site is given a unique number j ε 1,2,…Nu Each session is represented as a binary vector of size Nu. Navigation order is not considered. Example (number of valid URLs=4): 1001 -> user accessed URL 1 and 4 14 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 15 Unsupervised Niche Clustering (UNC) The goal is to find ◦ profiles pi (a set of URLs representing session clusters) and ◦ scales σi (variance/dispersion of sessions in a cluster around the cluster representative profile) wij : robust weight of a profile pi on a session item sj. If this value is large (i.e. a profile is “close” to a session), pi is a strong representative of sj, which has a positive effect on the fitness value of pi. 16 Randomly select Np sessions as initial pi’s Initialize the variable to some small value Repeat : ◦ Calculate distance(!) dij between every profile pi and every session sj ◦ Calculate robust weight wij for every profile pi and every session sj ◦ Calculate scale σi for every profile pi ◦ Calculate fitness fi for every profile ◦ Repeat (GA loop) : Randomly select parent profiles Generate child profiles (through crossover and mutation) Calculate fitness values of the child profiles Apply deterministic crowding for replacement policy 17 Hierarchical Unsupervised Niche Clustering A divisive hierarchical version of UNC Repeat dividing clusters into smaller clusters hierarchically considering ◦ the required hierarchy level (Lmax) ◦ Maximum allowed cluster cardinality (Nsplit) ◦ Maximum allowed scale (σsplit) As a result, we have profile vectors and their scales. Sessions are assigned to the closest profiles. 18 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 19 Cosine similarity: Web Session Similarity (web site structure): Su(l,k) : URL-to-URL similarity* Distance used in UNC is then d = (1-Sweb)2 20 URL structure (tokenized URL paths P) http://a/b/c.html For dynamic content, relations with an externally defined taxonomy (“is-a” relation). http://products.php?id=1&category=x 21 For dynamic content (dynamic URLs), preprocess the data and map the dynamic URLs to strings separated by “/” using ontology. If we have such a table (taxonomy data), we can define a hierarchical structure even for the dynamic URLs. 22 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 23 After H-UNC, we have clusters of sessions. Summarize the sessions in each cluster as a profile vector pi, where pik is the frequency with which URLk was accessed in sessions belonging to cluster i. Example : ◦ For cluster 1, let s11 = 1001, s12=1100, s13=1001 ◦ Then p1 = (1)(0.33)(0)(0.66) Convert pi’s to binary vectors so that only URLs with some minimum weight remain. Example : ◦ let minimum URL weight be 0.5, ◦ then p1 = 1001 24 Extend for Robust Profiles Calculate weights wij for all sessions in a cluster (between profile i and session j) like in UNC Assign sessions with high weights (robust weights), to the cluster’s “core”. So, a cluster’s “core” is the group of sessions that are very similar to the representative profile. Thus, noisy sessions are eliminated. 25 Enrich the profiles with facets (additional profile descriptors) like: ◦ Search queries ◦ Inquiring companies ◦ Inquired companies using IP Addresses, whois.com, registration database etc… for the sessions in the cluster 26 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 27 Profile boundaries : pi vectors and σi (scale/variance/dispersion) values are used to determine the boundaries Profile compatibility : how much the boundaries of two profiles overlap Algorithm to “TrackProfiles”. The idea is: ◦ Divide the time into time periods, and generate profiles for each time period. ◦ Compare the similarity of profile vectors for consecutive time periods Ti and Ti+1 using Sweb ◦ If distance (i.e. 1-similarity) is <σprofile1 then two profiles found in Ti and Ti+1 are related. 28 Birth : New profile incompatible with old profiles Persistence : New profile compatible with an old profile ◦ One-to-one ◦ Bifurcation (splitting) ◦ mergal Death : No new profile found for an old profile Atavism (reappearance) : Old profile disappears then reappears Volatile : Dead profiles that have never been persistent 29 Example for profile merge 30 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 31 How close are the profiles to the original input data ◦ Precision : a profile with high precision should include “only” the true items ◦ Coverage(Recall) : a profile with a high coverage should include all data items Example : let session=1001, then profiles ◦ 1000, 0001 : high precision, low coverage ◦ 1101, 1011 : low precision, high coverage ◦ 1001 : high precision, high coverage (ideal but unrealistic case -> every session must be a profile) 32 So we need to balance precision and coverage with some small number of profiles to get high quality Qij for session j and profile i. Define ◦ Precision Precij = |sj ∩ pi | / |pi| ◦ Coverage Covij = |sj ∩ pi | / |sj| A combined measure for quality is defined as Qij = F1,ij = 2*precij*covij / (precij + covij) 33 So, we defined the quality measure between a profile and a session. Now, how do we capture the concept drift? The meaning is: ◦ Decide a minimum quality threshold Qmin to be satisfied ◦ Discover the profiles at time period T2 ◦ Take the sessions at the next time period T1, and for each session sj find the maximum quality Qij using a profile from the previous time frame ◦ If the quality is higher than Qmin, add this session sj to our quality sessions set denoted by s*(T1, T2) 34 As a result, we can measure quality by evaluating the equation below As long as most of the sessions at T1 are successfully represented by profiles found at T2, this rate will be high. If the minimum threshold quality Qmin is taken low, the rate will be high. The best case is 1. If Qmin is increased, number of sessions satisfying this quality decreases. 35 Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 36 Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 37 1. 2. 3. 4. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction Web Usage Mining Handling Profile Evolution Integrating Semantics Profile Discovery based on Usage Preprocessing Clustering sessions Similarity measure in clustering Post processing and Enrichment Profile Evolution Tracking the Evolving User Profiles Validating the Profile Evolution Experimental Results 38 They generate profiles with facets (like search queries, inquired companies, inquiring companies etc…) 39 Profiles are generated at first half of September Light lines compare profiles with the sessions in the same time frame, i.e. first half of September (they are identical at all graphs) Dark lines compare the same profiles with the sessions in the following time frames (cross validations). 40 In this experiment, profiles from T are validated against the sessions in the immediately following time period T+1 For Figure 1: ◦ Profiles are generated using the sessions in the first half of September ◦ Light line shows the validation using the sessions in the same time period ◦ Dark line shows the validation using the sessions in the following time period, i.e. second half of September 41 There is one more experiment, but the only difference with the previous one is that they use a shorter time period (1 week) in their observations. The idea is the same. 42 The work presented in this paper is an unsupervised learning that tries to learn mass anonymous user profiles The profiles are mined in a no-memory revolutionary scheme. The evolving profiles are validated in a fullmemory mode. 43 In the paper, facets are used to support profiles with additional information. But it is not mentioned how they are used. (e.g. the most searched companies etc) 44