Download Web Usage Mining Structuring semantically enriched

Document related concepts
no text concepts found
Transcript
Web Usage Mining
Structuring semantically enriched clickstream data
by
Peter I. Hofgesang
Stud.nr. 1421247
A thesis submitted to the
Department of Computer Science
in partial fulfilment of the requirements
for the degree of Master of Computer Science at the
Vrije Universiteit
Amsterdam, The Netherlands
August 2004
supervisor
Dr. Wojtek Kowalczyk
Faculty of Sciences, Vrije Universiteit Amsterdam
Department of Computer Science
second reader
Dr. Elena Marchiori
Faculty of Sciences, Vrije Universiteit Amsterdam
Department of Computer Science
Abstract
Web servers worldwide generate a vast amount of information on web users’ browsing
activities. Several researchers have studied these so-called clickstream or web access log data to
better understand and characterize web users.
Clickstream data can be enriched with information about the content of visited pages and the
origin (e.g., geographic, organizational) of the requests. The goal of this project is to analyse
user behaviour by mining enriched web access log data. We discuss techniques and processes
required for preparing, structuring and enriching web access logs. Furthermore we present
several web usage mining methods for extracting useful features. Finally we employ all these
techniques to cluster the users of the domain www.cs.vu.nl and to study their behaviours
comprehensively.
The contributions of this thesis are a data enrichment that is content and origin based and a treelike visualization of frequent navigational sequences. This visualization allows for an easily
interpretable tree-like view of patterns with highlighted relevant information.
The results of this project can be applied on diverse purposes, including marketing, web content
advising, (re-)structuring of web sites and several other E-business processes, like
recommendation and advertiser systems.
4
Content
Introduction .......................................................................................................................... 7
Related research ................................................................................................................... 9
Data preparation ................................................................................................................. 11
3.1
Data description ......................................................................................................... 11
3.2
Cleaning access log data............................................................................................ 13
3.3
Data integration ......................................................................................................... 17
3.4
Storing the log entries................................................................................................ 17
3.5
An overall picture ...................................................................................................... 18
4
Data structuring .................................................................................................................. 20
4.1
User identification ..................................................................................................... 20
4.2
User groups................................................................................................................ 21
4.3
Session identification................................................................................................. 22
4.4
An overall picture ...................................................................................................... 23
5
Profile mining models ........................................................................................................ 25
5.1
Mining frequent itemsets ........................................................................................... 25
5.2
The mixture model..................................................................................................... 27
5.3
The global tree model ................................................................................................ 29
6
Analysing log files of the www.cs.vu.nl web server .......................................................... 35
6.1
Input data ................................................................................................................... 35
6.2
Distribution of content-types within the VU-pages and access log entries ............... 39
6.3
Experiments on data structuring ................................................................................ 40
6.4
Mining frequent itemsets ........................................................................................... 44
6.5
The mixture model..................................................................................................... 52
6.6
The global tree model ................................................................................................ 59
7
Conclusion and future work ............................................................................................... 64
Acknowledgements ..................................................................................................................... 66
Bibliography................................................................................................................................ 67
APPENDIX................................................................................................................................. 69
APPENDIX A. The uniform resource locator (URL)............................................................. 69
APPENDIX B. Input file structures ........................................................................................ 69
APPENDIX C. Experimental details ...................................................................................... 71
APPENDIX D. Implementation details .................................................................................. 81
APPENDIX E. Content of the CD-ROM................................................................................ 84
1
2
3
5
Structure
This Master Thesis is organized as follows:
Chapter 1, “Introduction”
This chapter provides a high-level overview of the related research and main goals of this
project.
Chapter 2, “Related research”
Chapter 2 gives a comprehensive overview of the related research known so far.
Chapter 3, “Data preparation”
This chapter follows through all steps of the data preparation process. It starts describing the
main characteristics of the input data followed by a description of the data cleaning process. The
section on data integration will explain how the different data sources are merged together for
data enrichment while the next section concerns data loading. Finally an overall scheme and an
experiments section are laid out.
Chapter 4, “Data structuring”
In chapter 4 we explain how the semantically enriched data is combined to form user sessions. It
also discusses the process of user identification and gives a description of groups of users, both
of which are preliminary requirements of the identification of sessions. The chapter ends with
an overall scheme of data structuring followed by a section of experiments.
Chapter 5, “Profile mining models”
This chapter provides an overview of the theoretical background of applied data mining models.
First it explains the widely used mining algorithm of frequent itemsets. The following section
describes the recently researched mixture model architecture. And finally a tree model is
proposed for exploiting the hierarchical structure of session data.
Chapter 6, “Analysing log files of the www.cs.vu.nl web server”
Chapter 6 discusses experimental results of mining models applied on the semantically enriched
data. All the input data are related to a specific web domain: www.cs.vu.nl.
Chapter 7, “Conclusion and future work”
Finally in chapter 7 we present the conclusions of our research and explore avenues of future
work.
6
1 Introduction
The extensive growth of the information reachable via the Internet induces its difficulty in
manageability. It raises a problem to numerous companies to publish their product range or
information online in an efficient, easily manageable way. The exploration of web users’
customs and behaviours plays a key role in dissecting and understanding the problem.
Web mining is an application of data mining techniques to web data sets. Three major web
mining methods are web content mining, web structure mining and web usage mining. Content
mining applies methods to web documents. Structure mining reveals hidden relations in web
site and web document structures. In this thesis we employ web usage mining which presents
methods to discover useful usage patterns from web data.
Web servers are responsible for providing the available web content on user requests. They
collect all the information on request activities into so-called log files. Log data are a rich source
for web usage mining.
Many scientific researches aim at the field of web usage mining and especially at user behaviour
exploration. Besides, there is a great demand in the business sector for personalized, customdesigned systems that conform highly to the requirements of users.
There is a substantial amount of prior scientific works as well on modelling web user
characteristics. Some of them present a complete framework of the whole web usage mining
task (e.g., Mobasher et al. (1996) [18] proposed WEBMINER).
Many of them present page access frequency based models and modified association rules
mining algorithms, such as [1, 31, 23]. Xing and Shen (2003) [30] proposed two algorithms
(UAM and PNT) for predicting user navigational preferences both based on page visits
frequency and page viewing time. UAM is a URL-URL matrix providing page-page transition
probabilities concerning all users’ statistics. And PNT is a tree based algorithm for mining
preferred navigation paths. Nanopoulos and Manolopoulos (2001) [21] present a graph based
model for finding traversal patterns on web page access sequences. They introduce one levelwise and two non-level wise algorithms for large paths exploiting graph structure.
While most of the models work on global “session levels” an increasing number of researches
show that the exploration of user groups or clusters is essential for better characterisation: Hay
et al. (2003) [14] suggest Sequence Alignment Method (SAM) for measuring distance of
sessions incorporated within structural information. The proposed distance is reflected by the
number of operations required to transform sessions into one another. SAM distance based
clusters form the basis of further examinations. Chevalier et al. (2003) [8] suggest rich
navigation patterns consisting of frequent page set groups and web user groups based on
demographical patterns. They show the correlation between the two types of data.
Other researches point far beyond frequency based models: Cadez et al. (2003) [4] propose a
finite mixture of Markov models on sequences of URL categories traversed by users. This
complex probability based structure models the data generation process itself.
In this thesis we discuss techniques and processes required for further analysis. Furthermore we
present several web usage mining methods for extracting useful features. An overall process
workflow can be seen in figure 1.
7
DATA
INTEGRATION
DATABASE
MM
FORMAT
Content type mapping table
2
1
5
4
3
Association rules
s
Web server’s
access log data
AR
FORMAT
Text
DATA
FILTERING
PROFILE MINING
T
e
x
t
SESSION IDENTIFICATION
T
e
x
t
Text
DATA PREPARATION
Probability
INPUT DATA
Content types
Mixture model
URL / content type
3
Identified
sessions
USER
SELECTION
GTM
FORMAT
2
Tree model
Geographical and
organizational information
3
3
Figure 1: The overall process workflow
This thesis considers three separate data sets as input data. Access log data are generated by the
web server of the specified domain and contains user access entries. The content-type mapping
table contains relations between documents and their category in the form of URL / content type
pairs. Mapping tables can either be generated by classifier algorithms or by content providers. In
the case of this latter type, contents of pages are given explicitly in the form of content
categories (e.g., news, sport, weather, etc.). Geographical and organizational information make
it possible to determine different categories of users.
All data mining tasks start with data preparation, which prepares the input data for further
examination. It consists of four main steps as it can be seen in figure 1. Data filtering strips out
irrelevant entries, data integration enriches log data with content labels and the enriched data are
stored in a database. The user selection process sorts out appropriate user entries of a specified
group for session identification.
The following step in the whole process is session identification. Related log entries are
identified as unique user navigational sequences. Finally these sequences are written to output
files in different formats depending on the application.
The profile mining step applies several web usage mining methods to discover relevant patterns.
It uses an association rules mining algorithm [1] for mining frequent page sets and for
generating interesting rules. It also applies the mixture model proposed by Cadez et al. (2001)
[5] to build a predictive model of navigational behaviours of users. Finally it presents a tree
model for representing and visualizing visiting patterns in a nice and natural way.
In the experimental part of this thesis we employ all these techniques to address the problem of
defining clusters on the users of the www.cs.vu.nl web domain and we study their behaviours
comprehensively.
The contributions of this thesis are content based data enrichment and visualization of frequent
navigational sequences. Data enrichment amplifies users’ transactional data with the content
types of visited pages and documents and makes distinctions among users based on
geographical and organizational information. The visualization presents a tree-like view of
patterns that highlights relevant information and can be interpreted easily.
8
2 Related research
There are numerous commercial software packages usable to obtain statistical patterns from
web logs, such as [11, 22, 37]. They focus mostly on highlighting log data statistics and
frequent navigation patterns but in most cases do not explore relationships among relevant
features.
Some researches aim at proposing data structures to facilitate web log mining processes. Punin
et al. (2001) [24] defined the XGMML and LOGML XML languages. XGMML is for graph
description while the latter is for web log description. Other papers focus only (or mostly) on
data preparation [6, 13, 15]. Furthermore there are complete frameworks presented for the
whole web usage mining task (e.g., Mobasher et al. (1996) [18] proposed WEBMINER).
Many researches, such as [1, 23, 31], present page access frequency based models and modified
apriori [1] (frequent itemset mining) algorithms. Some papers (e.g., [32] [10] [9]) present online
recommender systems to assist the users’ browsing or purchasing activity. Yao et al. (2000) [32]
use standard data mining and machine learning techniques (e.g., frequent itemset mining, C4.5
classifier, etc.) combined with agent technologies to provide an agent based recommendation
system for web pages. While Cho et al. (2002) [10] suggest a product recommendation method
based on data mining techniques and product taxonomy. This method employs decision tree
induction for the selecting of users likely to buy the recommended products.
Hay et al. (2003) [14] apply sequence alignment method (SAM) for clustering user navigational
paths. SAM is a distance-based measuring technique that considers the order of sequences. The
SAM distance of two sequences reflects the number of transformations (i.e., delete, insert,
reorder) required to equalize them. A distance matrix is required for clustering which holds
SAM distance scores for all session pairs. The analysis of the resulting clusters showed that the
SAM based method outperforms the conventional association distance based measuring.
In their paper Runkler and Bezdek (2003) [27] use relational alternating cluster estimation
(RACE) algorithm for clustering web page sequences. RACE finds the centers for a specified
number of clusters based on a page sequence distance matrix. The algorithm alternately
computes the distance matrix and one of the cluster centers in each iteration. They propose
Levenshtein (a.k.a edit) distance for measuring the distance between members (i.e. textual
representation of visited page number sequences within sessions). Levenshtein distance counts
the number of delete, insert or change steps necessary to transform one word into the other.
Pei et al. (2000) [23] propose a data structure called web access pattern tree (WAP-tree) for
efficient mining of access patterns from web logs. WAP-trees store all the frequent candidate
sequences that have a support higher than a preset threshold. All the information stored by
WAP-tree are labels and frequency counts for nodes. In order to mine useful patterns in WAPtrees they present WAP-mine algorithm which applies conditional search for finding frequent
events. WAP-tree structure and WAP-mine algorithm together offer an alternative for apriorilike algorithms.
Smith and Ng (2003) [28] present a self-organizing map framework (LOGSOM) to mine web
log data and present a visualization tool for user assistance.
Jenamani et al. (2003) [16] use a semi-Markov process model for understanding e-customer
behaviour. The keys of the method are a transition probability matrix (P) and a mean holding
time matrix (M). P is a stochastic matrix and its elements store the probabilities of transition
9
states. M stores the average lengths of time for processes to remain in state i before moving to
state j. In this way this probabilistic model is able to model the time elapsed between transitions.
Some papers present methods based on content assumptions. Baglioni et al. (2003) [2] uses
URL syntactic to determine page categories and to explore the relation between users’ sex and
navigational behaviour. Cadez et al. (2003) [4] experiment on categorized data from
Msnbc.com.
Visualization of frequent navigational patterns makes human perception easier. Cadez et al.
(2003) [4] present a WebCanvas tool for visualizing Markov chain clusters. This tool represents
all user navigational paths for each cluster, colour coded by page categories. Youssefi et al.
(2003) [33] present 3D visualization superimposed web log patterns and extracted web structure
graphs.
10
3 Data preparation
Preparing the input data is the first step of all data and web usage mining tasks. The data in this
case are, as mentioned above, the access log files of the web server of the examined domain and
the content types mapping table of the HTML pages within this domain.
Data preparation consists of three main steps such as data cleaning/filtering, data integration and
data storing. Data cleaning is the task of removing all irrelevant entries from the access log data
set. Data integration establishes the relation between log entries and content mappings. And the
last step is to store the enriched data into a convenient database. A comprehensive study has
been made by Cooley et al. (1999) [13] on all these preprocessing tasks.
This chapter starts with the description of the input data and generation procedure, followed by
the details of log access file cleaning and data integration for log entries and mapping data
integration. Finally it presents the database scheme for data storing and an overall picture and
description of the data preparation process.
3.1 Data description
This section describes the details of the access log and content type mapping data.
3.1.1 Access log files
Visitors to a web site click on links and their browser in turn requests pages from the web
server. Each request is recorded by the server in so-called access log files1. Access logs contain
requests for a given period of time. The time interval used is normally an attribute of the web
server. There is a log file present for each period and the old ones are archived or erased
depending on the usage and importance.
Most of log files of web servers are stored in a common log file format (CLFF) [34] or in an
extended log file format (ELFF) [35]. An extended log file contains a sequence of lines
containing ASCII characters terminated by either the sequence LF or CRLF. Entries consist of a
sequence of fields relating to a single HTTP transaction. Fields are separated by white space. If
a field is unused in a particular entry dash, a "-" marks the omitted field.
Web servers can be configured to write different fields into the log file in different formats. The
most common fields used by web servers are the followings: remotehost, rfc931, authuser, date,
request, status, bytes, referrer, user_agent.
1
There are other types of log files generated by the web server as well, but this project does not consider
them.
11
The meanings of all these fields are explained in the table below with given examples:
The most commonly used fields of access log file entries by web servers
Field name
Description of the field (with example)
Remote hostname (or IP number if DNS hostname is not available)
remotehost
rfc931
authuser
[date]
"request"
example:
82.168.4.229
The remote login name of the user.
example:
The username with which the user has authenticated himself.
example:
Date and time of the request with the web server’s time zone.
example:
[20/Jan/2004:23:17:37 +0100]
The request line exactly as it came from the client. It consists of three
subfields: the request method, the resource to be transferred, and the used
protocol.
example:
"GET / HTTP/1.1"
The HTTP status code returned to the client.
status
bytes
"referer"
"user_agent"
example:
200
The content-length of the document transferred.
example:
12079
The url the client was on before requesting the url.
example:
"-"
The software the client claims to be using.
example:
"Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
Table 1
3.1.2 Content types mapping table
A content types mapping table is a table containing URL/content type pair entries. URLs are file
locator paths referring to documents, and content types are labels giving the types of documents
(for more details about URL refer to APPENDIX A). Content types can either be generated by
an algorithm or by content providers where the contents of pages are given explicitly (e.g., sport
pages refer to “sport” content, etc.). Generator algorithms can also be distinguished depending
on whether they produce the content types automatically or are driven by human interaction.
12
We use an external algorithm [3], which attaches labels to all HTML documents in a collection
of HTML pages based on their contents. The algorithm is based on the naive Bayes classifier
supplemented by a smart example selector algorithm. It uses only the textual content of the
HTML pages stripping out the control tags. Some parts of the text enclosed within special tags
(e.g., title or header tags) are biased. The algorithm chooses the first 100 pages randomly to be
categorized by humans. This initialization step is followed by an “active learning” method. This
method chooses the examples by considering the ones already selected.
This thesis deals with other documents besides HTML as well (eg. pdf, ps, doc, rtf, etc.).
However it would be a difficult process to attach labels to each of them based on their content.
This is because the structure of these files is specific and most of the time very complex. And
their size is usually very large. For these reasons a very simple technique is used to identify
such documents. The label “documents” is attached to all pdf and ps files that refers to scientific
papers, e-books, documentations, etc., while the label “other documents” is attached to all other
document types (e.g., doc, rtf, ppt, etc.). Other documents determine e.g., administrative papers,
forms, etc. According to these remarks, a mapping table is completed to contain entries for the
two labels.
The following table presents an example of content types mapping table:
An example of content-type mapping table
URL
content type identifier
bi/courses-en.html
4
ci/DataMine/DIANA/index.html
6
…
…
Table 2
3.2 Cleaning access log data
As described above, raw access log files contain a vast amount of variant request entries. Each
log entry can be informative for some application but this project excludes most of them.
Processing of certain types of requests would lead to misconclusions (e.g., requests generated
by spider engines). Besides, stripping the data has a positive effect on processing time and the
required storage space.
Since this project focuses only on documents themselves (like html, pdf, ps, doc files) all the
request entries on different file types should be stripped out. Furthermore as the main goal is the
characterization of users, robot transactions, which generate web traffic automatically by robot
programs, must also be filtered out. There are several other criteria for filtering. Detailed
descriptions of the filtering criteria and methods follow further on.
3.2.1 Filtering unsupported extensions
A typical web page is made up of many individual files. Beyond the HTML page it consists of
graphical elements, code styles, mappings etc., all in separate files. Each user request for an
13
HTML file evokes hidden requests for all the files required for displaying that specific page. In
this manner access log files contain all the hidden requests’ traces as well.
Extension filtering strips out all the request entries for file types other than predefined (for the
structure of extension list file refer to APPENDIX B4 Extension filter list file). Requested files’
extensions in log entries could be extracted from the “request” field.
An example of such request field:
"GET /ai/kr/imgs/ibrow.jpg HTTP/1.0"
3.2.2 Filtering spider transactions
A significant portion of log file entries is generated by robot programs. These robots, also
known as spider or crawler engines, automatically search through a specific range of the web.
They index web content for search engines, prepare content for offline browsing or for several
other purposes.
The common point in all crawlers’ activity is that, although they are mostly supervised by
humans, they generate systematic, algorithmic requests. So without eliminating spider entries
from log files, real users’ characteristics would be distorted by features of machines.
Spiders can be identified by searching for specific spider patterns in the "user_agent” field of
log entries. Most of the well-disposed spiders put their name or some kind of pattern that
identifies them into this field. Once a pattern has been identified, the filter method ignores the
examined log entry.
Spider patterns can be looked up browsing the web for spiders. There are several pages
considering spider activities and patterns, and there are lots of professional forums on the
subject (mostly discussing how to avoid them) [29].
Spider patterns are collected in a separate spider list file (refer to APPENDIX B5).
An example of such user_agent field:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"
3.2.3 Filtering dynamic pages
Web pages generated on user requests dynamically are called “dynamic pages”. These pages
can not be located on the web server as an individual file, since they are built by a specific
engine using several data sources. For this reason dynamic pages cannot be analyzed in a simple
way. However with the application of several tricks it is possible to still obtain useful
information. Jacobs et al. (2001) in [15] use an inductive logic programming (ILP) framework
to reveal usage patterns based on dynamic page link parameters that are passed to the server.
Since it is not an objective of this thesis to apply sophisticated methods for information recovery
on dynamic pages, the filtering process simply eliminates all such reference.
14
There is no standard for the structure of URL requests for dynamic pages except that parameters
appear after the “?” (question mark) in the URL which consist of name/value pairs. Therefore,
dynamic pages can basically be filtered out by searching for the question mark in “request”
fields of log entries. Note that requests for a single dynamic page without any parameters, thus
without the delimiter question mark, would be stripped out during extension filtering (e.g., *.jsp,
*.php, *.asp pages).
An example of such a dynamic page’s request field:
"GET /obp/overview.php?lang=en HTTP/1.0"
3.2.4 Filtering HTTP request methods
HTTP/1.0 [25, 26] allows several methods to be used to indicate the purpose of a request. The
most often used methods are GET, HEAD and POST. Since using the GET method is the only
way of requesting a document that could be useful for this project, the request method filter
ignores any other requests. The filter examines the “request” field of the log entry for the
“GET” method identifier.
An example of such a request field:
"POST /modules/coppermine/themes/default/theme.php HTTP/1.0"
3.2.5 Filtering and replacing escape characters
URL escape characters are special character sequences made up of a leading “%” character and
two hexadecimal characters. They substitute special characters in URL requests that could be
problematic while transferring requests to different types of servers. Special characters are
simply replaced by sequences of standard characters.
In most cases the task is only to replace these escape sequences with their representatives, but in
certain instances URLs contain corrupted sequences that cannot be interpreted. In these cases
the entries should be ignored. Corrupt sequences can be caused by typing errors of the users,
automatically generated robot requests, etc.
3.2.6 Filtering unsuccessful requests
If a user requests a page that does not exist, his browser replies with the well-known “404 error,
page not found” error message. In this case the user has to use the “back” button to navigate
back to the previous page or type a different URL manually. Either way the user doesn’t use the
requested page to navigate through it, since the error page doesn’t provide any link to follow.
For this reason log entries of erroneous requests should also be ignored. These entries can be
filtered by examining the “status” field. The status of corrupt requests mostly equals to “404”.
In special cases status field can take other values as well, such as “503” etc.
15
An example of such a log entry:
200.177.162.127 - - [16/May/2004:08:07:42 +0200] "POST
/modules/coppermine/include/init.inc.php HTTP/1.0" 404 302 "-" "Mozilla 4.0 (Linux)"
3.2.7 Filtering request URLs for a domain name
A URL of a page request consists of a domain name and the path of the requested document
relative to the public directory of the domain. Since the domain name is not ambiguous to the
responsible web server, it stores only the relative path of the request in the access log files,
without the domain name. In a few cases however, log file entries tend to contain the whole
absolute path. This leads to mapping errors during data integration, since the mapping table
contains only relative paths and comparison is based on paths similarity. For these reasons a
URL in the “request” field has to be transformed to the relative format.
An example of such request field:
"GET /www.cs.vu.nl/fb/generated/wrk_units/120.html HTTP/1.1"
3.2.8 Path completion
When a user requests a public directory instead of a specific file, the web server tries to find the
default page in that directory. The default page is “index.html” in most cases, but it varies
between web servers. Thus the task is to complete the URL with the name of the default page in
case a log entry contains a directory request. It is possible that the server does not contain the
default page in the requested directory. In this case the certain log entry will be filtered while
looking it up in the content type mapping table (refer to section 3.1.2 Content types mapping
table).
An example of such a request field:
original request field: "GET /pub/minix/ HTTP/1.1"
completed request field: "GET /pub/minix/index.html HTTP/1.1"
3.2.9 Filtering anchors
Anchors are special qualifiers for HTML link references. They act as reference points within a
single web page. If a named anchor is placed somewhere in the HTML page’s body, a link
referring to the HTML page completed with a special character hash mark and the name of the
anchor (e.g., link + “#” + anchor name) following the link will scroll directly to the place where
the anchor is put.
Anchors should be stripped out from URLs, otherwise the HTML document can not be found in
the mapping table.
An example of such a request field:
"GET /vakgroepen/ai/education/courses/micd/opgave_1.html#1c HTTP/1.1"
16
We don’t filter frame pages. Frames are supported by the HTML specification and make it
possible to split an HTML document into several “sub documents” (e.g., a frame for the
navigation menu, a frame for the content, etc.). Each frame refers to a specific HTML
document, resulting in a separate page request. The main frame page contains mostly special
tags for controlling all the subframes. This page is either labelled miscellaneous or labelled the
same as its subframes by the text mining algorithm [3]. Either way there is no need to pay
special attention to such pages while preparing the data.
3.3 Data integration
A novel approach in this project is to use content types of the visited pages rather than URL
references. Content types, as described earlier, are given in a special mapping table where each
entry consists of an URL/content type pair (refer to section 3.1.2 Content types mapping table).
Data integration in this context means that there should be a content type label attached to every
single stored log entry.
The most simple and convenient method is to attach content labels to transactions during data
cleaning2. This would save time, since it uses the same cycle for both processes.
After cleaning and filtering a log entry, the data integration step looks up the entry’s request
URL in the mapping table. If the URL is present, the corresponding type label is attached to the
entry. Otherwise the extension of the URL is checked for a valid document type, other than
HTML (refer to section 3.2.1 Filtering unsupported extensions), and looked up in the table
again. If the extension was an HTML page, it should be deleted3.
3.4 Storing the log entries
The final step of the data preparation is to store the data in a convenient database. MySQL was
chosen as a database server in spite of the fact that the current version does not support stored
procedures.
In most cases it would be easier and faster to use internal methods for manipulating the data
inside the database, but there were no inextricable difficulties that occurred during the project in
this context. The advantages of MySQL are that it is fast, easy to maintain, free to use for
research purposes and it’s widely accepted. The database scheme for storing cleaned log entries
can be seen in table 3.
2
Depending on the application. For continuous streaming data, a better solution would be to attach labels
online to entries, and probably it would use the content identification model also to identify unknown
contents besides a preset mapping table.
3
This step could be improved by using the original classifier model in case of a missing URL.
17
Database scheme of the cslog table
column name
type name
id
bigint
remotehost
varchar
rfc931
varchar
authuser
varchar
transdate
datetime
request
text
content_type
tinyint
status
smallint
bytes
int
referer
text
user_agent
text
Table 3
The column names respond to the log field names mentioned in section 3.1.1 Access log files
except for the content_type field which refers to the attached content type described in the
previous paragraph and id which is the unique identifier of the entries.
3.5 An overall picture
The following figure gives an overall picture of our data preparation scheme.
Loading/filtering/mapping access log data
MappingTable
Object
mapping_table.mtd
Transaction
(filtered,mapped)
LogParser
Object
Transaction
TransactionFilter
Log2Database
Object
Object
Object
Object
RAW LOG
cslog.txt
extension.flt
datahandling.prop
DATABASE
spider.flt
Figure 2: An overall picture of the data preparation
18
The first step in the data preparation process is to load raw log files into the memory line by line
by the LogParser object. This object transforms all entries into suitable Transaction objects,
which contain all the fields of the log file. Once a Transaction has been parsed, it goes through
the TransactionFilter, which filters out useless entries (by simply ignoring them). After this step
a content-type label is attached to all transactions by the MappingTable object. Finally
Log2Database loads the filtered transactions into the specified database.
19
4 Data structuring
Sessions a.k.a. transactions4 constitute the basis of most web log mining processes. They are
related to users and composed of pages visited during a separate browsing activity.
This chapter starts with the description of user identification, which is essential for session
identification. This is followed by details on grouping of users, which is also a relevant topic as
characterization of them is the main goal of this project. The next paragraph deals with session
identification methods and types, while discussing moreover how the selection method is
restricted to groups of users. The final section presents a comprehensive overview of the data
structuring process.
4.1 User identification
Identification of users is essential for efficient data mining. It makes it possible to distinguish
user specific data within the whole data set.
It is straightforward to identify users in Intranet applications since they are required to identify
themselves by following the login process. It is much more complicated in the case of public
domains. The reason is that Internet protocols (e.g., HTTP, TCP/IP) do not require user
authorization from client applications (e.g., web browser). The only private information
exchanged is the machine IP address of the client. Identification based on this information is
unreliable. This is because multiple users may use the same machine (thus the same IP address)
to connect to the Internet. And on the other hand, a single user may use several machines to use
the same service. Besides, proxy servers and firewalls hide the true IP address of the client.
There are many solutions to resolve this problem. Content providers can force users to register
for their services. In this way users have to follow a login process each time they want to
browse their contents. To avoid explicit user authentication, servers can use so called cookies.
Cookies are user specific files stored on client machines. Each time a user visits the same
service, the server can obtain user information from stored cookies.
The most accurate identification based solely on access log files is to use both IP address and
browser agent type as a unique user identification pair [13]. However some papers use IP/cookie
pairs [2].
The identification procedure proposed in this thesis takes place inside the database as a select
query, which fills up the users table from the cslog table. Table 4 shows the data scheme of the
users table.
4
Market basket analysis terminology uses “transaction” in terms of items purchased at once. Meanwhile
the information technology (IT) sector denotes “transaction” for unique client-server request-respond
information exchanges. Furthermore IT terminology also uses the term session (which is analogous to
market basket) to denote consequent user page visits a.k.a. navigation sequences. To resolve the conflict,
this thesis uses both terminologies for determination of navigation sequences except in chapter 3 Data
preparation, where “transaction” translates to page accesses.
20
Data scheme of the users table
column name
type name
id
bigint
remotehost
varchar
host_name
varchar
TLD
varchar
user_agent
text
Table 4
Remotehost and user_agent fields are equal to the above mentioned pair while host_name and
TLD will be discussed in the next section (4.2 User groups).
4.2 User groups
Arranging users into specific user groups is essential for further examinations. All the statistics
and models described later are based on sessions belonging to user groups.
The advantage of user authenticated systems is the availability of personal information on
registered users. This would help to form the most exact and diverse groups for them.
Possibilities are restricted to the information which can be mined from access log files in case of
public domains. In public domains, groups can be formed based on user IP addresses (network
ranges), geographical data, visiting frequency, etc.
Access log file entries contain either the IP address or the domain name in the remotehost field.
For this reason in both cases the IP address or the domain name should be looked up and
updated in the users table. After this process the remotehost field should refer to IP addresses
while the host_name field refers to the corresponding domain name in users table.
Organizational groups A natural grouping of users is present in most internal networks in the
term of subnetwork address ranges. Subnetwork address ranges determine sub network domains
within the whole network. There can be separate network ranges for user groups like staff,
management, students, administration, etc. Using these ranges and IP addresses of users, a
variety of groups can be formed.
Geographical groups Most of the network (IP) addresses or network ranges have a domain
name registered to them. The domain name consists of level and sublevel names divided by
dots. The most right-hand side name of the whole string refers to the top level domain (TLD).
TLD can be country codes like nl, hu, uk, etc. or other reserved names for public organizations
such as com, org, gov etc. The rest of the domain name could be built of organization names
followed by department names etc., all in hierarchical structure (e.g., www.cs.vu.nl).
Geographical distinction among users can be set up using TLD names. A group can be formed
for example based on the “nl” TLD. Users can be selected for this group by searching for “nl”
TLD in their corresponding domain name. No special geographical observations can be
obtained from organizational TLDs, such as network infrastructure (net) and commercial (com)
top level domains. This is because these domains can be registered worldwide and thus they
have no clear relationship to countries.
21
4.3 Session identification
Sessions constitute the basis of most web log mining processes. They are related to users and
composed of pages visited during a separate browsing activity. Visited pages belong to a
specific domain and form a sequence in visiting order.
It is worth mentioning that not all the requests are present in log files. Most of the browsers use
cache technology that allows the usage of previously visited pages instead of downloading them
again. Besides, proxy servers also use page caching. They collect all frequently visited pages
within a company and store them to reduce bandwidth load.
This result on some pages is visited in “offline” mode in a visiting sequence. That means that no
entry refers to these accesses in log files. This problem can be solved by setting the expiration
timestamp of pages to minimal, which forces clients to download expired pages. However this
solution assumes that we can change the structure of documents. Several methods were
proposed (e.g., [13]) to offer algorithmic solutions for this problem. We believe that the main
characteristics can be observed without the necessity of such data preparation techniques.
There are several session identification methods described in different scientific literatures [6,
13, 20]. The most widely accepted methods are the so called time frame (or time window)
identification [13] and the maximal forward reference (MFR) identification [7].
Both methods work on pre-selected page accesses, so they work on data grouped by users and
ordered by access time. The data consists of the user identification number (id field), the date
and time of page access (transdate field) and the content type of the visited page (content_type
field). In addition, MFR requires the request URL (request field).
The time frame identifier method divides page accesses for a user using a time window. This
window or time interval is suggested to be approximately 30 minutes [13, 14, 30]. Most of the
commercial products set a 30 minute timeout interval for splitting. The identifier iterates
through the entries and whenever an entry’s access time (transdate) is out of the time interval it
starts a new session and starts to measure time interval from that entry again.
The maximal forward reference identifier adds page access entries to a session list up to the
page before a backward reference is made. Backward reference is defined to be a page that is
already contained in the set of pages for the current transaction. In that case it starts a new
session list and goes on with iteration. For example, an access sequence of A B C B D E E E F
G would be broken into four transactions, i.e. A B C, B D E, E, and E F G. The drawback of this
method is that it does not consider that some of the “backward” references may provide useful
information. And besides it may include entries within the same session even if a week elapsed
between them.
22
4.4 An overall picture
The figure below represents the functional model of session identification:
GetSessions
UserIPGroupSelector UserCountryGroupSelector
Object
Object
UserGroupSelector
Object
selected . entries
SessionFormatPrinter
TransactionMemoryIterator
Object
page access
entries
cslog, users
tables
Object
TransactionSimple
Object
DATABASE
User sessions
identified . sessions
webmining.prop
in the appropriate
data format
Identifier
Object
TimeFrameIdentifier
MFRIdentifier
Object
Object
Figure 3: Functional model of the session identification process
At the beginning TransactionMemoryIterator object retrieves all the log entries from cslog table
ordered by id and sub-ordered by transdate.
Note that although the number of log entries can be large, the memory requirement of the whole
dataset is still manageable because all the information needed for an entry is its id, content_type
and transdate (and URL for MFR identification).
After fetching the data, TransactionMemoryIterator iterates through the user ids and for each id
it forces UserGroupSelector to decide whether the given user belongs to a group or not. More
specifically UserGroupSelector could be a subnet network ranges (UserIPGroupSelector)
selector or a geographical group selector (UserCountryGroupSelector) depending on the settings
in the webmining.prop properties file (for more information on group selections refer to section
4.2 User groups).
When a user is selected by the group selector it is passed forward to the Identifier for
identification of access entries into user sessions.
23
Note again that an Identifier could more specifically be, as it was described earlier in session
identification section (4.3 Session identification), a time frame identifier (TimeFrameIdentifier)
or a maximal forward reference identifier (MFRIdentifier).
Finally, identified sessions for a user are appended to the output file by the SessonFormatPrinter
in the appropriate format (e.g., association rule format, mixture model format, global tree model
format, etc.).
24
5 Profile mining models
So far we discussed all techniques and steps required for data preparation and data enrichment.
This chapter deals with the discussion of data mining models used in this project for pattern
discovery on enriched data.
It starts with an explanation of the widely used association rules mining technique and follows
with the discussion of a recent model called mixture model. Finally it presents the global tree
model, which represents session data in a natural way and makes it easy to mine sessionspecific statistics on stored data. This model is also able to represent its structure in an easily
interpretable graphical way.
Consider the following formal notion5 as dataset representation for all the models described
below:
Notion 5.1
Let D = {D1 , D2 ,..., D N } be a transaction or session data set generated by N
individuals, where Di is the observed data on the i th user, 1 ≤ i ≤ N . Each individual
data set Di consist of a set of one or more transactions for that user, i.e.,
Di = { y i1 ,..., y ij ,..., y ini } , where ni is the total number of transactions observed for
user i , and y ij is the j th transaction for user i , 1 ≤ j ≤ ni .
An individual session y ij consists of content-type references of visited pages within a
user session. y ij = {nij1 ,..., nijk ,..., nijkij } , where k ij is the length of the i th user’s j th
session, k ij ≥ 1 .
nnij is a content-type reference, which can take values from the content type reference
range: 1 ≤ n nij ≤ K . Each reference of the range 1...K refers to a content group (refer
to section 3.1.2 Content types mapping table).
5.1 Mining frequent itemsets
One of the most well known and popular data mining techniques is the association rules (AR) or
frequent itemsets mining algorithm. The algorithm was originally proposed by Agrawal et al.
[1] for market basket analysis. Because of its significant applicability, many revised algorithms
have been introduced since then, and AR mining is still a widely researched area.
5
Note that the notion is almost the same as it was proposed in [9], with the difference that transactions are
not considered as sets of items but rather as an ordered list of content types of visited pages within a
session.
25
The aim of association rule mining is exploring relations and important rules in large datasets in
expressions of the form “if premise then conclusion” ( X → Y
X ∩ Y = 0 ) implication form.
A dataset is considered as a sequence of entries consisting of attribute values also known as
items. A set of such items is called an itemset (entries themselves are itemsets). Formally,
Let I = {i1 , i 2 ,..., i n } be a collection of all items, where i j ∈(1...n) is an item. An
itemset is a collection of items, where each item can occur at most once. A
transaction or session is an itemset.
Using the notions (Notion 5.1) introduced at the beginning of this chapter, items refer to n nij
content-type references and an itemset is a y ij user session with the restriction that each item
can occur at most once.
A problem with association rules is that for a given i number of items there are 2 i itemsets and
for each k − itemset there are 2 k rules. This could result in an unacceptable amount of rules.
The solution is to consider only rules with a support and confidence higher than s and c .
Let X → Y
X ∩ Y = 0 be an association rule.
It has support s (in D ) if s % of transactions from D contain X ∪ Y .
It has confidence c if c% of transactions from D that contain X also contain Y .
The problem of mining association rules can be decomposed in two major steps:
1. Find all frequent itemsets that have support greater than the threshold s and
2. for each frequent itemset, generate all the rules that have confidence greater than the
threshold c.
“Apriori” was the first association rules mining algorithm. Lots of improved algorithms (most
of them are “apriori”-based) have been introduced since it was published. In the following we
give the pseudo code of the “apriori” algorithm [1].
26
Initial conditions
Lk : set of large k-itemsets (have minimal support)
C k : set of candidate k-itemsets
D : set of transactions (as described above), t ⊂ D
s: support threshold
Algorithm
L1 = { frequent 1 − itemsets};
for (k = 2; Lk −1 <> 0; k + +){
C k = Set of new candidates
for all transactions t ⊂ D
for all k − subsets sub of t
if ( sub ⊂ C k ) sub.count + +;
Lk = {c ⊂ C k | c.count ≥ s}
}
Set of all frequent itemsets = UkLk
C k = set of new candidates
step1 C k ⇐ empty
step 2 C k ⇐ { p ∪ q | p, q ∈ Lk −1 ∧ | p ∪ q |= k}
step3 C k − { p | p ∈ C k ∧ ∃(k − 1) − subset ∉ Lk −1 }
Rules can be generated incrementally, starting from 1-itemset conclusions, because of the
property of confidence:
Let L be a frequent itemset and A ⊂ L is a subset, then the following
statement is true:
If confidence of ( L − A) ⇒ A is c then for any B ⊂ A the confidence of
( L − B) ⇒ B is at least c.
5.2 The mixture model
In their paper Cadez et al. (2001) [5] proposed a generative mixture model for predicting user
profiles and behaviours based on historical transaction data. A mixture model is a way of
representing a more complex probability distribution in terms of simpler models. It uses a
Bayesian framework for parameter estimation on the other hand the mixture model addresses
27
the heterogenity of page visits. Even if a user hasn’t visited a page before, the model can predict
it with a low probability.
Cadez et al. (2001) presented both a global and an individual model, this thesis applies only the
global mixture model. Transaction data consistently mean web page visits or sessions in this
thesis, instead of the slightly different market basket data mentioned in [5]. While sessions are
ordered sequences of visited pages, market baskets are sets of purchased items. However
session data can be simply transformed towards the market basket data structure for applying
mixture model:
Notion 5.2 (alteration of Notion 5.1)
For the mixture model approach transaction notion should be altered in the following
way: an individual session y ij consists of counts of content type references of visited
pages within a user navigational sequence. y ij = {nij1 ,..., nijk ,..., nijK } , where nijk
indicates how many pages of k content type are in the i th user’s j th session,
1 ≤ k ≤ K , 0 ≤ nijk .
The global mixture model consists of K components. Each of the components describes a
prototype transaction forming a basis function. A component models a specific session’s
prototype which consists of visited page types with counts relatively higher than for other items.
A K-component mixture model for modeling a users’ site visit y ij is given below:
Notion 5.3 – K-component mixture model
K
p ( y ij ) = ∑ α k Pk ( y ij ) (1)
k =1
Where α k > 0 is the component weight for the k-th component,
∑α
k
k
= 1.
Pk ,1 ≤ k ≤ K is the k-th mixture component.
As for modeling components, [5] proposed a simple memoryless multinomial model. For every
component there is a multinomial distribution Θ k = (Θ k 1 ,..., Θ kC ) present, conditioned on nij ,
the total number of pages visited in the i-th user’s j-th session. The mixture model (Notion 5.3 –
(1)) completed with multinomials can be written as
Notion 5.4 – Mixture model with multinomials
K
C
k =1
c =1
p ( y ij ) = ∑ α k ∏ Θ kcijc (2)
28
n
The full data likelihood is presented below with the independency assumption of an individual’s
behaviour:
Notion 5.5 – Full data likelihood
N
p ( D | Θ) = ∏ p( Di | Θ) (3)
i =1
Θ represents the unknown parameters: both the parameters of the K component
multinomials, {Θ1 ,..., Θ K } , and the α vector for profile weights, {α1 ,..., α K } .
The unknown parameters {Θ1 ,..., Θ K } and {α 1 ,..., α K } are estimated by an expectation
maximization (EM) algorithm.
5.3 The global tree model
Pei et al. (2000) [23] propose a WAP-tree architecture for efficiently mining frequent itemsets.
The tree based model contains besides the tree structure a link-queue for each type of label. The
queues connect all the same labels forming chains. Xing and Shen (2003) in [30] present socalled preferred navigation tree (PNT) for mining preferred navigation paths. PNT stores URL,
frequency of visits and visiting time in its nodes.
In our approach we use a global tree model (GTM). The GTM provides a special representation
of session data for groups of users. The structure of the model is similar to that of the PNT
presented in [30]. The model preserves the information obtained from the structure of sessions
and it stores individual pages in visiting order. In this model sessions with the same prefix share
the same branch of the tree. This results in less storage required for the model. Also, the model
was built to be able to visualize frequent navigational paths in a tree structure. Visualization
helps to understand the patterns by highlighting relevant information.
Each node in a tree model registers four pieces of information: content-type label, frequency
number, reference to its parent node and reference to its children nodes. The root of the tree
model is a special virtual node with an optional title label and frequency 0. Every other node is
labelled by one of the content-type labels and is associated with a frequency which stores the
number of occurrences of the corresponding prefix ended with that content-type in the original
session database. A model consists of K6 branches (session trees) connected to the virtual root
node. Each branch contains a root node labelled with a unique content-type identifier. A branch
stores only those user sessions which start with a page labelled with the same content-type as its
root’s. Figure 4 presents the visualization of a sample tree model.
An A → B path of a tree from any A node to any B node (where the level number of A in the
tree is not greater than that of B) represents one or more subsessions where the frequency
6
K is the number of content-types, refer to Notion 5.1.
29
number of the B node represents the total number of sessions containing this ordered
subsequence pattern.
A special case of the A → B path is when A is the root node (of a session tree). In this case the
path represents one or more sessions or subsessions depending on the frequency of B node and
the sum frequency of its children nodes:
Let f B be the frequency number of the B node and let sum =
∑f
c
for all C children node of B
be
the summed frequency of its children nodes.
Let Root → B be the path from the root node to the B node, then Root → B
represents at least one real session if f B > sum , in which case the f B − sum
difference gives the number of real Root → B sessions.
Building the tree model
Model building starts with the initialization of the K session trees. All trees are initialized for a
unique k content type. Then all sessions of the data set are added to its correspondent session
tree. Each session is examined for its first page type and a tree is selected according to the
result.
Adding a session to its tree can be implemented recursively. The recursive function takes a
parent node and subsession parameters and updates or creates the child node of this parent with
the content-type given by the first element of the subsession. The recursive step is to pass the
child node as parent parameter and the new subsession parameter arises from the removal of the
first entry of the original subsession. The recursive process stops when the length of the
subsession is equal or less than one.
Algorithm to build the global tree model
Initial conditions
s ∈ D :session
si : is the ith element of session s
sessionTrees: array [1..K] of SessionTree
SessionTree: tree object for k content type, consists of a root
node and children nodes
node: is a SessioTree node containing
ct: content-type of node
freq: is the frequency of this node
parent: node reference to parent node
children: array [1..L] of node references
30
Algorithm
scheme of the algorithm:
init sessionTrees;
for all s ∈ D {
sessionTrees[ s1 ].add ( s );
}
initialization of sessionTrees:
init sessionTrees {
for i = 1..K {
sessionTrees[i ] = SessionTree(i,0);
root.ct ⇐ i;
root. freq ⇐ 0;
root. parent ⇐ null ;
root.children ⇐ null ;
}
}
adding a session to the correspondent SessionTree:
sessionTrees[content _ type].add ( s : session){
if s1 <> content _ type
return;
addSession(root , s );
}
addSession(node : parentNode, s : session){
node. freq + +;
if s.length > 1{
s.removeFirstElement ();
if child for s. firstElement () exists{
addSession(child for s. firstElement (), s );
} else {
addSession(create child for s. firstElement (), s );
}
}
}
31
Mining preferred paths from GTM
Preferred navigation paths can be mined directly from the tree model. (A, B) paths or sessions
that have a higher support than a preset threshold value are the preferred navigation paths.
The algorithm given below scans each level of all session trees for possible candidates ignoring
branches that have low support.
Mining preferred paths
initial conditions
candidates : the list of candidate nodes
candidateChildren : list of candidate children nodes
supported : list of supported sessions and their support value
s : support treshold
algorithm
candidates ⇐ root node
while candidates.size() <> 0 do {
candidateChildren ⇐ empty
for i = 1..candidates.size() do {
if frequencyOf (root , candidates[i ]) ≥ s {
supported ⇐ add ((root , candidates[i ]), support)
}
// gather possible child candidates
for j = 1.. candidates[i ].NumberOfChildren do {
if child j . freq ≥ s{
candidateChildren ⇐ add (child j );
}
}
}
candidates ⇐ candidateChildren;
}
Trees’ similarity
In the following a tree similarity measure is proposed for determining different tree models’
distance. By the means of the similarity measure we can determine the likeness of trees. We
expect that the similarity measure of two trees built based on two distinct session data set will
be high if the data sets were generated by users of similar behaviours. The proposed distance
measure considers both the structure of the tree and the frequency of tree nodes. The distance
satisfies the following criteria:
32
Assumptions
1.
A similarity distance measures not only the structure of trees but also (or
rather) the frequencies of their nodes. Higher frequencies should be taken into account
with higher weights.
2.
The extra information that originates from sessions should be exploited.
3.
Considering T1 , T2 trees: the distance of T1 from T2 should be equal to the
distance of T2 to T1 . Formally T1 .dist (T2 ) = T2 .dist (T1 ) .
The distance measure proposed is a simple approach based on forming the intersection of the
two trees’ session dataset.
Trees’ similarity
initial conditions
T1 , T2 : the two tree model
candidates : the list of candidate nodes
candidateChildren : list of candidate children nodes
sum : registers the number of all common sessions
algorithm
sum ⇐ 0;
candidates ⇐ put all the root nodes from session trees
that occure in both trees;
while candidates.size <> 0 do {
for i = 1..candidates.size do {
if T1 and T2 both have (root , candidatesi ) session( s ) {
sum ⇐ add the number of common sessions;
}
candidateChildren ⇐ put all the children nodes of candidatesi
that are present in both trees;
}
candidates ⇐ candidateChildren;
}
The similarity proportion can be easily calculated then by dividing the sum value by the
summed number of different sessions in the two trees (the summed number of all sessions in the
two trees subtracted the sum value). If we multiply the resulted value by 100 we get the
similarity percentage.
33
Visualization of tree models
Frequent navigational paths are conventionally represented by text or tables which are not easy
to understand. Visualization of a tree model however makes it easy to interpret the patterns. A
picture of a tree model consists of nodes with content-type labels and their colour code. Nodes
are connected with lines (edges) in different thickness marking the frequencies of given paths.
Besides thickness, edges contain proportional numbers for each child of a node measuring the
distribution of frequencies for the given children nodes. Besides, the number of “real” sessions
for that path of the tree is also given in parentheses. The tree visualization contains only the
supported sessions based on a support threshold set for the model. Figure 4 presents a sample
tree.
Figure 4: Visualization of a sample tree model
The sample tree above (figure 4) contains nine different content-type nodes. Its most frequent
“starting” node is english/department. 62% of the visitors (that is 9 visitors in this case) start on
the department pages and then go on to the faculty pages. Faculty pages have 100% of visiting
rate within this branch which means that all of the users whom went on from department pages
visited faculty pages also etc.
34
6 Analysing log files of the www.cs.vu.nl web server
For the purpose of this thesis the discussion will be restricted to the analysis of user behaviours
for a single web domain www.cs.vu.nl. Therefore all the data used in the following experiments
are in connection with the web server of the Computer Science Department of the Vrije
Universiteit.
This chapter presents experimental results using all the techniques described earlier. The first
section describes the details of the input access log files and mapping table. This is followed by
experimental results of data preparation and data structuring techniques. Finally, the last
sections present results of the three profile mining models AR, MM and GTM.
Results of association rules and frequent itemsets mining can show which page sets users tend
to visit within a session and what rules can be defined on frequent itemsets. A mixture model
can tells what distribution the data come from and how many components (based on different
user behaviours) are likely to have generated the data. Both the AR mining and the Mixture
model ignore the information which can be mined from the order of pages within sessions. The
global mixture model, in contrast, is based on the structure of sessions. It can answer the
question which session sequences (or subsequences) are highly preferred by users. It also
provides a visualization of frequent navigational paths in the tree structure.
Most of the algorithms were implemented in the Java programming language. For further details
on their implementation each section refers to the proper APPENDIX table.
Only the most frequent and most important patterns will be presented in this section but an
additional CD-ROM for this master thesis contains all the results and outputs experimented
(refer to APPENDIX E).
6.1 Input data
The input data in this case are the access log files of the www.cs.vu.nl web server for a certain
period of time, the content types mapping table of the HTML pages of the www.cs.vu.nl domain
and the organizational and geographical information for user group identification.
6.1.1 Access log files
Four consecutive access log files were collected and merged together from the www.cs.vu.nl
server. In total they sum up to one month of access log entries. The details are summarized in
the table below:
Details on the merged access log entries
File name
Size (MB)
Period
Number of entries
cs_access_log_20040530-20040704
1 533, 344
30 May 2004 – 4 July 2004
7 126 732
Table 5
35
The apache web server of the www.cs.vu.nl domain writes the following fields, in the given
sequence, into the log files: remotehost, rfc931, authuser, date, request, status, bytes, referrer,
user_agent. For the accepted access log file structure refer to APPENDIX B2.
6.1.2 The mapping table
Data enrichment is partly based on the content information of visited web pages. This
information is given by a table with URL/content type entries. The table was generated by a text
mining algorithm that was developed in a different project [3]. The text mining algorithm
attaches labels to all HTML pages of a document set based on their contents.
The HTML pages (VU-pages) were downloaded by wget [36] invoked with the following
parameters:
wget -l5 -r -t5 -A.htm,.html http://www.cs.vu.nl
The given parameters force wget to download all the *.htm and *.html files from the
www.cs.vu.nl domain in the depth of five levels recursively. In case of a page access failure it
retries to download the page four more times again.
This resulted in a collection of 13.001 HTML pages (with a total size of about 90MB) that were
consequently assigned to 19 categories:
Description of the content-types (content-categories)
type
type name
description
id.
This type refers to pages containing a negligible quantity of
textual information with one or more images.
1
photo
It most likely refers to personal photo albums, lecture slides or
informational pages with messages like “under construction”
or “this page has been moved to …”.
“Miscellaneous” type refers to pages with absent or insufficient
content. It most likely refers to framesets, empty, file list, form,
2
miscellaneous
moved or redirected pages. It can contain photo pages as well
in case that the page doesn’t contain relevant textual
information.
3
dutch/department
This type-group contains department pages in Dutch.
“English/reference” group most likely refers to pages
containing e-books or manual pages for different systems or
4
english/reference
programs. It can be a manual for an operation system or an
API reference for a programming language. It contains pages
written in English.
This group most likely refers to pages containing invitations for
official or free time activities. Among these events can be
science conferences, exhibitions, concerts, trips for
5
english/activity
international students or any other happening which is in
connection with the University. The group contains pages
written in English.
6
english/department This category contains department pages in English.
This type most likely refers to research projects of the
7
english/project
computer science department written in English.
8
english/person
This group is most likely refers to pages of the faculty
36
/faculty
9
english/person
/student
10
english/person
/faculty/publication
11
english/course
12
dutch/course
13
14
dutch/person
/student
dutch/person
/faculty
15
other_language
16
dutch/project
17
dutch/activity
18
documents
19
other documents
members. They are usually very formal and they mostly
consist of fields of research, professional background,
research projects and other information related to the
member’s research area or department. It contains pages
written in English.
This group most likely refers to student pages. Student pages
mostly contain personal information (e.g., hobby, lyrics, etc.)
and links to pages of friends and courses. The group contains
pages written in English.
“English/person/faculty/publication” category most likely refers
to pages containing publications of faculty members
comprising at least the abstracts. It contains pages written in
English.
This group most likely refers to course pages. They mostly
contain the description of the course, lecture slides,
recommended literature and set assignments in English.
The same as the english/course group, but containing pages
written in Dutch.
The same as the english/person/student group, but containing
pages written in Dutch.
The same as the english/person/faculty group, but containing
pages written in Dutch.
This type-group contains pages written in other languages
than English or Dutch.
The same as the english/project group, but containing pages
written in Dutch.
The same as the dutch/activity group, but containing pages
written in Dutch.
This group contains documents in Adobe Acrobat (pdf) or
Postscript (ps) format. They are most likely to be scientific
papers, publications, e-books, etc.
“Other documents” contains documents in Microsoft Word
(doc), Microsoft PowerPoint (ppt), Microsoft Excel (xls), Rich
text (rtf) or plain text (txt) format. They are most likely to be
administrative papers, forms, course materials etc.
Table 6: Description of the content-types
The labelling algorithm (supported by human) provided only an approximate categorization of
pages. Roughly, about 74% of pages got the right labels; see [3] for details.
To reduce the length of type names in some places we will use the letter “E” and “D” referring
to English and Dutch groups (e.g., “E/department” refers to “english/department”). For the
accepted (by the webmining package) file structure of the mapping table refer to APPENDIX
B3.
37
6.1.3 Experiments on data preparation
The following table contains statistical results of the access log data filtering and content type
data integration.
Statistics of the access log data filtering and content type data integration
Filtering method
Number of filtered
(bad) records
Percentage to the
total nr. of entries
5 062 579
2 779 464
205 046
55 343
10 305
381 454
728
778 079
795 987
348 048
71,04%
39,00%
2,88%
0,78%
0,14%
5,35%
0,01%
10,92%
11,17%
4,88%
447 939
6,29%
Unsupported extensions
Spider transactions
Dynamic pages
Unsupported HTTP request methods
Corrupt escape characters
Unsuccessful requests
Domain filter
Path completion or anchor stripping
All methods (valid transactions)
Mapping errors (on valid transactions)
Total transactions stored (valid
transactions with content type)
Table 7
All numbers in the table were compared to the total number of records and for this reason the
sum of percentages doesn’t amount to 100%. Most of the filtering methods above are required
for more exact user behavioural results. Except for the elimination of dynamic pages that is not
a necessity for this reason. The analysis of dynamic pages requires a much more sophisticated
system. Since the targeted domain does not contain a significant amount of dynamic pages
(2,88% of total accesses), and it can be assumed that static and dynamic pages would mostly not
be mixed during one single user session, the correspondent filter simply ignores all dynamic
pages appearing in the log files.
Not surprisingly, statistics show that most of the entries contain requests for unsupported file
types. Also a vast amount of transactions are generated by spider transactions. The mapping
table does not contain content type entries for 43,73% of the valid transactions. We assume this
result from a frequent change in the www.cs.vu.nl’s pages. Mapping errors occur when a page
referred by a log entry is missing from the page collection (and therefore from the mapping
table). The total number of valid transactions after mapping the content-types to log entries is
447 939. These were stored in a database.
For the implementation details on data cleaning and data integration, refer to APPENDIX D1
and APPENDIX D2.
38
6.2 Distribution of content-types within the VU-pages and
access log entries
The following table shows the frequencies of content-types within the VU-pages and the access
log entries of the www.cs.vu.nl.
Distribution of content-types within the VU-pages and access log entries
within VU-pages
within access log entries
percent. percent.
id
category
frequency percentage frequency
of (1-17) of total
34671
9,49%
7,74%
1 photo
2 432
18,68%
67149
18,38%
14,99%
2 miscellaneous
2 727
20,95%
17698
4,84%
3,95%
3 dutch/department
3
0,02%
13240
3,62%
2,96%
4 english/reference
966
7,42%
640
0,18%
0,14%
5 english/activity
32
0,25%
23138
6,33%
5,17%
6 english/department
269
2,07%
14984
4,10%
3,35%
7 english/project
441
3,39%
69334
18,97%
15,48%
8 english/person/faculty
1 084
8,33%
17203
4,71%
3,84%
9 english/person/student
549
4,22%
english/person/faculty
39208
10,73%
8,75%
10
1 901
14,60%
/publications
19588
5,36%
4,37%
11 english/course
111
0,85%
11755
3,22%
2,62%
12 dutch/course
806
6,19%
31709
8,68%
7,08%
13 dutch/person/student
1 210
9,29%
260
0,07%
0,06%
14 dutch/person/faculty
10
0,08%
1718
0,47%
0,38%
15 other_language
417
3,20%
212
0,06%
0,05%
16 dutch/project
27
0,21%
2901
0,79%
0,65%
17 dutch/activity
26
0,20%
66801
14,91%
18 documents*
15730
3,51%
19 other documents*
365408
- total without 18-19
13 011
100,00%
100%
447939
- total with all
100,00%
* The mapping table entries contain only the extensions for categories 18 and 19.
Table 8
The two distributions of content types (table 8) show relevant information on user behaviours.
According to their relatively large proportion among the collection of HTML pages category
photo, E/reference and D/course is overmatched within user visit frequencies. One would not
expect the relatively low proportion of course (dutch and english) visits (8,58%). Furthermore
the high proportion of visits from the Netherlands (refer to figure 5 in section 6.3.1) may
indicate that students visit course pages mostly from home. Publications
(E/person/faculty/publication) are mostly visited from foreign countries as one may expect. On
the other side E/person/faculty, E and D/department categories have higher rates in log entries.
The high proportion of E/department category can be explained by that pages within this class
are placed on the top level hierarchy of the VU-pages’ structure. Thus they present the links for
reaching other pages. And besides many users within the VU set department pages as their
starting page. The E/reference category is mostly visited from other countries and documents
are mostly downloaded also from foreign countries as it can be seen in figure 5. The summed
39
proportion of English pages within the VU-pages is 41,13% against the 15,99% of Dutch pages.
And on the other side 54% of the log entries belong to English categories while 17,66% of the
entries belongs to Dutch page requests.
6.3 Experiments on data structuring
This section provides details on data structuring. It starts with presenting the user groups and
their related statistics. While in the continuation it shows details on the session identification.
6.3.1 The user groups formed for the users of www.cs.vu.nl
The remotehost field of a log entry is either given in the form of an IP address or of a domain
name. The IP address is required for grouping users by network ranges while the domain names
are important for the geographical sorting. The UpdateDBIPAddresses program was used to
update all the remotehost fields of the cslog table (refer to table 3 in section 3.4 Storing the log
entries) to the corresponding IP addresses that were given by domain names. The next step is to
select users into the users table using the updated remotehost fields and user_agent fields from
the cslog table. In the following step, the UpdateDBHostNames program fills in the host_name
field for every corresponding remotehost address in the users table. Meanwhile the processing
of the domain names it also determines their top level domain (TLD) addresses and fills in the
TLD field. For details on UpdateDBIPAddresses and UpdateDBHostNames refer to APPENDIX
D1.
A total number of 118,141 users have been identified from log entries based on unique
remotehost/user_agent pairs. The following groups make distinctions among these users.
After identifying all the available IP addresses and domain names, the following demographical
data can be obtained in relation with the users’ TLD field. The table below contains the details
of the first 20 most frequent TLDs. TLDs are ranked by frequency and a summarized count for
all the other top level domains is present in the last row. A table containing all the details of the
TLDs can be found in the APPENDIX C3.
The 20 most frequent top level
domains
rank TLD count
country
1
nl
19248 Netherlands
network
2
net
18299
infrastructure
3
com 11457 commercial
4
fr
3125 france
5
be
3058 belgium
6
de
3001 germany
7
ca
2133 canada
8
it
2038 italy
9
uk
1903 united kingdom
10
au
1852 australia
educational
11
edu
1803
establishments
12
jp
1532 japan
13
14
15
16
17
18
19
20
40
br
ch
mx
pl
at
fi
dk
se
1485
963
935
878
635
610
553
531
-
8642
-
33460
brazil
switzerland
mexico
poland
austria
finland
denmark
sweden
sum of all other
countries
number of users
without
geographical
information
Table 9
Not surprisingly the table represents the fact that most of the user visits come from the
Netherlands. Besides the home country, users tend to show keen interest on computer science
pages of the VU from nine particular other countries. Three of them are neighbouring (or near)
countries, namely France, Belgium and Germany. Among the visitors from these countries
would probably be students looking for further studies or fellow researchers interested on
project or member details. The other six countries are spread worldwide. There are a total
number of 33,460 users without geographical information. This is because their IP addresses
cannot be looked up for domain names.
The following user groups were formed according to the available geographical and
organisational information.
Geographical groups
Groups formed by geographical information (by TLD acronyms) are described in the table
below:
The description of the geographical groups
group name
description
nl
Contains users identified by the “nl” top level domain.
All the other countries and organizations differ from the “nl”
TLD. Note that we didn’t eliminate com, org, net and edu
TLDs from this category despite their “undeterminable”
geographical origin. They form the basis and the most
other
frequent part of this group, thus eliminating them would
result in a loss of many valuable user sessions. However,
during the analysis we have to consider that a significant
part of this group may belong to the “nl” group.
nr of users
19248
65433
Table 10
Organizational groups
In the Computer Science Department of Vrije Universiteit there are separate network ranges for
users groups like staff, students, administration, etc. Groups identified from their belonging
address ranges are described in the table below:
The description of the organizational groups
group name
description
Contains users identified by the subnet network range
staff
addresses for teachers of the Computer Science
department.
Contains users identified by the subnet network range
student
addresses for student machines of the Computer Science
department.
nr of users
274
567
Table 11
Figure 5 shows the distributions of content-types for user groups. The most popular group was
the geographical “other” group followed by the “nl” group (with a proportion of 58,74% and
41,26% among all geographically labelled transactions). Not surprisingly the organizational
groups have a much lower visit rate compared to the geographical groups, since they contain
41
much less user (the proportion of “staff” and “student” groups are almost identical, 52,42% and
47,58%). For more details on figure 5 refer to the analysis of table 8 (section 6.2).
Content-type distributions among user groups
19
18
17
16
15
14
13
content-type identifiers
12
11
10
9
8
7
6
5
4
3
2
1
0
5000
10000
15000
v isiting fre que ncy
20000
25000
nl
other
Figure 5: Distribution of content-types among user groups
42
30000
staff
35000
student
6.3.2 Session identification
There are two session identification methods described earlier in this thesis. The following table
shows statistics on time frame (TF) identified sessions for all user groups. The timeout
parameter was set to the “standard” [13] 30 minutes length.
User group session statistics for time frame identification
total nr. of
Session length statistics
group name
sessions
min
avg
max
all users
165 778
1
2.7
2 299
Geographical groups
nl
39 671
1
3.39
275
others
79 750
1
2.4
352
Subnet network ranges
staff
2 795
1
5.5
193
students
3 123
1
4.47
134
std. deviation
9.2
7.29
4.74
11.41
6.36
Table 12
The table above shows that users visit around 3-5 pages per average within a single session.
Statistics also show that users within the VU tend to visit more pages per sessions than the
average. The surprisingly large maximal length within “all users” is likely to refer to a spider
transaction. However checking the details in raw transaction data shows no signs of spider
activity. Neither the user_agent field contains any spider pattern nor do requested pages show
systematic download. This can be because some spiders, for various reasons, pretend to be a real
user.
The table below contains statistics on maximal forward reference (MFR) identified sessions for
all user groups. It shows that all groups contain much more sessions than in case of TF
identification. This derives from the nature of MFR identification which breaks the session if a
page has been previously occurred in it.
User group session statistics for maximal forward reference identification
total nr. of
Session length statistics
group name
sessions
min
avg
max
std. deviation
all users
247407
1
1.81
705
4.23
Geographical groups
nl
63478
1
2.12
705
5.04
other
115764
1
1.65
487
2.87
Subnet network ranges
staff
6245
1
2.46
108
5.7
students
5306
1
2.63
66
2.42
Table 13
Geographical groups don’t sum up in both tables to the total session number of all users. This is
because there are lots of IP addresses with missing domain names in the database (host names
can not be looked up for them).
According to our observations we are going to use time frame identified session data for further
experiments. The TF identificator seemed to be more realistic on the examined database entries
and most of the researchers apply also this method, such as [30]. All the following experiments
are based on session data instead of raw log entries.
43
6.4 Mining frequent itemsets
This section will provide information on frequent page sets and association rules for all the user
groups. The AR implementation used by this project for data analysis is an Apriori-T (Apriori
Total) algorithm, developed by the LUCS-KDD research team, which makes use of a "reverse"
set enumeration tree where each level of the tree is defined in terms of an array (i.e. the T-tree
data structure is a form of Trie)7 [12]. For further details on the implementation refer to
APPENDIX D4.
The support and confidence threshold values for the association rules mining algorithm were
tuned to give as much important patterns as possible and to keep the percentage of useless
information in a low level.
Analysis of all sessions presents an overall picture of all the user sessions retrieved from the
database. A more sophisticated characterisation will follow in the part for analysis of the
geographical and organizational groups.
6.4.1 The analysis of all visits
The analysis of frequent itemsets within sessions of all users gives an overall picture of user
behaviour on the www.cs.vu.nl domain. Frequent one-itemsets with their supports are presented
in the table below:
Frequent one-itemsets of all visits
items (content-type labels and category names)
1 (8) E/person/faculty
2 (10) E/person/faculty/publication
3 (2) miscellaneous
4 (6) E/department
5 (11) E/course
6 (13) D/person/student
7 (4) E/reference
8 (18) documents
9 (1) photo
10 (7) E/project
11 (19) other documents
12 (3) D/department
13 (12) D/course
14 (9) E/person/student
support
51,10%
35,45%
32,81%
18,87%
16,89%
16,54%
16,00%
15,56%
14,65%
12,45%
9,64%
9,59%
9,52%
9,51%
Table 14
Item (1) shows that more than half of the sessions contain pages of faculty members (in
English) and 35,45% of them include publication pages of faculty members (2). The high
support of miscellaneous pages (3) does not indicate any special custom. It shows probably that
7
The input data for the algorithm contain sessions with redundant elements removed and types in
ascending order. Trivial sessions that contain only one page are also stripped out.
44
a great proportion of the pages contain frames8. Department pages were used in 24,05% of the
transactions as (probably for) starting points of user visits9. Course pages were visited in
approximately 26% of the sessions (in this case the co-occurrence of the two categories is
negligible, approximately 0,5%). English course pages were almost twice as popular as Dutch
course pages. Dutch student pages are more popular than English ones. The joint occurrence of
English and Dutch student pages is 23,09% based on the same calculation.
Table 15 shows the selected frequent two-itemsets.
Frequent two-itemsets of all visits
items (content-type labels and category names)
1 (10) E/person/faculty/publication, (8) E/person/faculty
2 (8) E/person/faculty, (6) E/department
3 (8) E/person/faculty, (2) miscellaneous
4 (8) E/person/faculty, (1) photo
5 (10) E/person/faculty/publication, (2) miscellaneous
6 (18) documents, (8) E/person/faculty
7 (11) E/course, (8) E/person/faculty
8 (8) E/person/faculty, (4) E/reference
9 (13) D/person/student, (2) miscellaneous
10 (18) documents, (10) E/person/faculty/publication
11 (10) E/person/faculty/publication, (6) E/department
12 (10) E/person/faculty/publication, (4) E/reference
13 (11) E/course, (10) E/person/faculty/publication
14 (13) D/person/student, (8) E/person/faculty
15 (8) E/person/faculty, (7) E/project
16 (10) E/person/faculty/publication, (7) E/project
17 (9) E/person/student, (8) E/person/faculty
18 (6) E/department, (3) D/department
19 (13) D/person/student, (1) photo
20 (8) E/person/faculty, (3) D/department
21 (13) D/person/student, (12) D/course
22 (7) E/project, (6) E/department
23 (19) other documents, (11) E/course
24 (10) E/person/faculty/publication, (3) D/department
25 (19) other documents, (10) E/person/faculty/publication
26 (13) D/person/student, (9) E/person/student
support
19,44%
12,69%
11,72%
9,19%
9,12%
8,26%
8,13%
7,99%
7,83%
7,63%
7,27%
6,96%
6,81%
6,43%
6,17%
5,16%
4,61%
4,41%
4,08%
3,84%
3,70%
3,26%
3,23%
3,07%
2,97%
2,96%
Table 15
We can set up some rough custom models based on two-itemsets. 19,44% of the visits show
interest on information of faculty members and their research. Itemsets don’t provide sequential
information but presumably visits belonging to (1) consist of an entry page for a faculty member
and a consequent publication page of that person. (2), (6), (10) and (24) may also belong to this
“custom group”. (2) and (24) forecast that such visits start on the department pages and then go
on to faculty member pages. Itemset (6) and (10) show that many of the users download
scientific material from the pages of faculty members. (8), (12), (15) and (16) show special
interest on faculty member pages for project information and references. Itemsets that contain
8
Frame pages, as it was described earlier, mostly do not contain valuable information for the content
classifier algorithm.
9
This result comes from the sum of supports of the English and Dutch pages subtracted the support of
their co-occurrence, refer to table 15 of two-itemsets
45
miscellaneous type indicate that pages are probably structured in framesets, such as pages in
content categories 8, 10, 13 in itemsets (3), (5), (9). Itemset (4) can be interpreted as a primitive
model for “free time” or “photo viewer” activities. It contains page visits for photo galleries of
faculty members. These galleries mostly contain personal photos like travel etc. images. (19)
also relates to this custom group with the difference that it contains student photo gallery pages.
(7) (13) and (23) form a “study” custom group. Many persons of the scientific staff present all
their professional information on a single web page. The content classifier algorithm in this case
will probably choose a content-type that refers to the largest topic on it. This resulted
presumably in the strange combination of itemset (13). (7) and (13) basically indicates the same
consequence which is that they contain course page (in English) visits from faculty member
pages. In all certainty this member and the teacher of the course is the same person or has a
strong relation to the course. (23) shows that 3,23% of the visits result in the download of
course materials.
Frequent three-itemsets of all visits
items (content-type labels and category names)
(10) E/person/faculty/publication, (8) E/person/faculty,
1
(6) E/department
(18) documents, (10) E/person/faculty/publication,
2
(8) E/person/faculty
3 (10) E/person/faculty/publication, (8) E/person/faculty, (7) E/project
4 (11) E/course, (10) E/person/faculty/publication, (8) E/person/faculty
(10) E/person/faculty/publication, (8) E/person/faculty,
5
(4) E/reference
support
4,96%
4,21%
3,54%
3,05%
2,68%
Table 16
The table above contains the frequent three-itemsets. Itemsets (1), (2), (3) and (5) forms the
previously described “faculty member” or “research” custom group. A possible scenario for a
user visit based on these sets can be that a user starts the visit on the department pages. Then he
goes to a faculty member page and visits the member’s publication page. In the meantime he
downloads materials from the member’s pages. He would also with a great probability visit
project or reference pages from faculty member pages. (4) represents the “study” custom group.
Such visits start from a faculty member’s or his publication’s page (which probably also has a
mixed type of content) and ends on course pages related to the member.
Association rules of all visits
premise
1 (7) E/project, (10) E/person/faculty/publication
(6) E/department, (10)
2
E/person/faculty/publication
3 (6) E/department
4
(7) E/project, (8) E/person/faculty
5
6
7
(10) E/person/faculty/publication, (18) documents
(10) E/person/faculty/publication
(18) documents
8
(8) E/person/faculty, (18) documents
Table 17
46
conclusion
(8) E/person/faculty
confidence
68.7%
(8) E/person/faculty
68.22%
(8) E/person/faculty
(10) E/person/faculty/
publication
(8) E/person/faculty
(8) E/person/faculty
(8) E/person/faculty
(10) E/person/faculty
/publication
67.26%
57.39%
55.1%
54.82%
53.04%
50.94%
Association rules provide more information on frequent itemsets. Table 17 contains rules that
have higher confidence than 50%. (1) indicates that if a user visits project and publication pages
he will also visit faculty member pages with 68,7% confidence, etc. All the rules in the table
belong to the “research” custom group. This fact consolidates the importance of this type of
behaviour and indicates that it is the most significant among visiting behaviour types.
6.4.2 The analysis of the geographical groups
Table 18 shows the selected frequent one-itemsets of the “nl” and “other” geographical groups.
Frequent one-itemsets of the geographical groups
items (content-type labels and category
names)
1 (8) E/person/faculty
2 (13) D/person/student
3 (10) E/person/faculty/publication
4 (1) photo
5 (12) D/course
6 (6) E/department
7 (9) E/person/student
8 (18) documents
9 (7) E/project
10 (11) E/course
11 (4) E/reference
support
“nl” group
“other” group
44,57%
53,34%
36,13%
5,94%
22,81%
41,32%
20,17%
11,58%
19,86%
3,84%
17,90%
19,81%
12,69%
7,51%
11,70%
16,74%
11,38%
15,63%
8,54%
21,49%
8,35%
18,63%
Table 18
The “research” behaviour type is significant in both categories but considering the summed
support values for itemsets (1), (3), (8), (9) and (11) shows that research pages have an almost
50% higher visit rate within group “other”. The summed support values for content-categories
of type 8, 10, 4, 18 and 7 are 145,66% for “other” and 98,81% for “nl” user groups. “Free time”
behaviour is more frequent in the “nl” group based on the support values for student and photo
categories. While the support of the photo category is 20,17% in “nl” visits, the “other” group
contains 11,58% of the photo visit rate. Student pages are also frequently visited within the “nl”
group. The summed supports for Dutch and English student pages is 42,92% (subtracted their
co-occerrence) while the same value in the “other” group is approximately 13,45% (their cooccurrence within this category is negligible). Not surprisingly, the “study” custom group is also
more frequent in the “nl” than in the “other” group. Dutch and English course pages have 28,4%
of summed support in the “nl” group while the same value in the “other” group is 25,33%. In
case of the “nl” group it indicates that many students probably study and therefore visit course
pages from home. The “other” group contains very few Dutch course visits, which is the second
most frequent category among the “nl” visits, but has a surprisingly large amount of visits to
English course pages. This fact indicates that English course pages contain useful information
for foreign visitors.
47
Table 19 contains frequent two-itemsets of the geographical groups.
Frequent two-itemsets of the geographical groups
items (content-type labels and category names)
1 (10) E/person/faculty/publication, (8) E/person/faculty
2 (13) D/person/student, (8) E/person/faculty
3 (8) E/person/faculty, (6) E/department
4 (13) D/person/student, (1) photo
5 (9) E/person/student, (8) E/person/faculty
6 (8) E/person/faculty, (7) E/project
7 (13) D/person/student, (12) D/course
8 (8) E/person/faculty, (1) photo
9 (10) E/person/faculty/publication, (6) E/department
10 (8) E/person/faculty, (4) E/reference
11 (18) documents, (8) E/person/faculty
12 (13) D/person/student, (9) E/person/student
13 (12) D/course, (8) E/person/faculty
14 (9) E/person/student, (1) photo
15 (10) E/person/faculty/publication, (4) E/reference
16 (18) documents, (10) E/person/faculty/publication
17 (11) E/course, (8) E/person/faculty
18 (11) E/course, (10) E/person/faculty/publication
* Not supported by the set support threshold value.
support
“other”
“nl” group
group
14,17%
22,11%
12,04%
3,04%
9,42%
14,91%
8,59%
-*
8,55%
-*
8,43%
5,31%
8,36%
-*
8,24%
9,50%
7,02%
7,37%
6,17%
7,85%
5,91%
9,15%
5,90%
-*
5,60%
-*
4,24%
-*
4,17%
8,18%
4,11%
9,00%
3,56%
10,43%
3,33%
8,43%
Table 19
The table above shows that indeed the “other” group contains mostly “official” visits, such as
itemsets (1), (3), (6), (9), (10), (11), (15) and (16). Visitors in this group most likely start on
English department pages and from there they go on to faculty member pages and navigate to
member’s publication pages. A large percentage of them also visit reference and project pages
following links from faculty members’ pages. Many users within this group download
documents from faculty members. “Official visits” are also frequent in the “nl” group, but in
contrast with the “other” group it also contains a great number of “study” visits. (7), (13), (17)
and (18) support the assumption that most of the “study” visits start on the faculty member
pages and then go on to the course pages. It is interesting to note that the Dutch and English
pages are not mixing within sessions. Probably this is because the Vrije Universiteit provides
bachelor and masters degrees and while the official language of bachelor education is Dutch,
most of the courses are in English in case of the masters. The “other” group also contains a large
number of course pages in English visited from faculty members’ pages. Such visits can be
generated by interested teachers and students from abroad. (4), (8), (12) and (14) show “free
time” visits. (4) and (14) contain Dutch and English student page visits and visits for their photo
pages followed the links from them. (8) contains the same types of sessions for faculty member
pages. (2), (5) and (7) indicate a mixed activity of “free time” and other behaviour types.
48
Frequent three-itemsets of the geographical groups are presented in table 20.
Frequent three-itemsets of the geographical groups
items (content-type labels and category names)
support
“other”
“nl” group
group
(10) E/person/faculty/publication, (8) E/person/faculty,
(6) E/department
(10) E/person/faculty/publication, (8) E/person/faculty,
2
(7) E/project
3 (13) D/person/student, (8) E/person/faculty, (1) photo
(13) D/person/student, (9) E/person/student,
4
(8) E/person/faculty
5 (13) D/person/student, (9) E/person/student, (1) photo
(10) E/person/faculty/publication, (8) E/person/faculty,
6
(4) E/reference
(18) documents, (10) E/person/faculty/publication,
7
(8) E/person/faculty
(11) E/course, (10) E/person/faculty/publication,
8
(8) E/person/faculty
* Not supported by the set support threshold value.
1
4,49%
5,13%
4,41%
3,25%
4,41%
-*
4,37%
-*
3,68%
-*
3,30%
-*
2,63%
4,93%
-*
3,59%
Table 20
(1) shows the “classic” research visit. Users start on department pages, navigate to faculty
members’ pages and to the members’ publication pages. (2), (6), (7) and probably (8) also
belong to the research custom. (3), (4) and (5) present mostly “free time” visits. The “study”
behaviour type is missing from the three-itemsets. An explanation can be that students of the
University know the URLs of study pages exactly and go directly there instead of starting from
department pages and following through the links.
6.4.3 The analysis of the organizational groups
Table 21 shows the frequent one-itemsets of the “staff” and “student” organizational groups.
Frequent one-itemsets of the organizational groups
items (content-type labels and category names)
1 (8) E/person/faculty
2 (10) E/person/faculty/publication
3 (3) D/department
4 (6) E/department
5 (13) D/person/student
6 (1) photo
7 (18) documents
8 (12) D/course
9 (7) E/project
10 (4) E/reference
11 (9) E/person/student
12 (11) E/course
* Not supported by the set support threshold value.
Table 21
49
support
“student”
“staff” group
group
68,36%
59,41%
31,75%
36,88%
31,23%
16,31%
26,58%
18,61%
22,13%
16,38%
16,75%
5,88%
15,62%
18,40%
13,34%
16,59%
12,82%
41,50%
7,65%
19,03%
6,83%
6,16%
-*
10,36%
One would expect higher differences among the support values for content categories within the
“staff” and “student” organizational groups than the presented supports in table 21. One may
think that categories like student, photo and course pages are visited at a significantly higher
rate in the “student” than in the “staff” group. The opposite can be observed in case of itemsets
(5), (6) and (11). This fact shows that for some reasons teachers are more interested in student
and photo pages than students. A possible explanation for this phenomenon can be that Ph.D
students within the “staff” group visit their fellow student pages. The table shows also that both
groups are interested in “research” pages and members of the “staff” group don’t visit the
English course pages. (9) and (10) show that students are much more interested in project and
reference pages than teachers. Two- and three-itemsets will probably provide more information
on the afore-mentioned discrepancies.
Frequent two-itemsets are presented in the table below.
Frequent two-itemsets for the organizational groups
items (content-type labels and category names)
1 (10) E/person/faculty/publication, (8) E/person/faculty
2 (8) E/person/faculty, (3) D/department
3 (8) E/person/faculty, (6) E/department
4 (13) D/person/student, (8) E/person/faculty
5 (8) E/person/faculty, (1) photo
6 (18) documents, (8) E/person/faculty
7 (10) E/person/faculty/publication, (6) E/department
8 (8) E/person/faculty, (7) E/project
9 (8) E/person/faculty, (4) E/reference
10 (7) E/project, (4) E/reference
* Not supported by the set support threshold value.
“staff”
group
22,34%
19,23%
14,99%
14,89%
10,55%
10,55%
9,72%
8,79%
5,89%
-*
support
“student”
group
27,08%
5,95%
9,45%
5,11%
-*
7,07%
7,84%
37,16%
18,05%
15,75%
Table 22
(2) and (3) show that the “research” behaviour type is more general within “staff” group than it
is in “student”. Teachers may look for contact information (e.g., telephone number, email
address etc.) of their colleagues via faculty member pages. Interesting is that photo galleries of
faculty members (5) are only popular among teachers. (8) and (9) indicate that project and
reference pages mostly contain study material for students. Project and reference pages probably
cover useful information for course assignments that students have to do in groups. This would
explain that a great proportion of students use the University’s infrastructure to visit such pages.
50
Table 23 contains frequent three-itemsets of the organizational groups:
Frequent three-itemsets of the organizational groups
items (content-type labels and category names)
(8) E/person/faculty, (6) E/department,
(3) D/department
(10) E/person/faculty/publication, (8) E/person/faculty
2
(6) E/department
(13) D/person/student, (8) E/person/faculty
3
(10) E/person/faculty/publication
(10) E/person/faculty/publication, (8) E/person/faculty
4
(3) D/department
(10) E/person/faculty/publication, (6) E/department
5
(3) D/department
(10) E/person/faculty/publication, (8) E/person/faculty
6
(7) E/project
(18) documents, (10) E/person/faculty/publication,
7
(8) E/person/faculty
8 (8) E/person/faculty, (7) E/project, (4) E/reference
(10) E/person/faculty/publication, (8) E/person/faculty
9
(4) E/reference
(10) E/person/faculty/publication, (7) E/project,
10
(4) E/reference
11 (12) D/course, (8) E/person/faculty, (6) E/department
* Not supported by the set support threshold value.
1
“staff”
group
support
“student”
group
6,83%
2,52%
6,10%
4,48%
5,79%
-*
5,17%
-*
5,07%
-*
4,45%
18,61%
3,72%
3,43%
-*
15,68%
-*
12,32%
-*
11,41%
-*
2,52%
Table 23
(2) and (4) show the “classic” “research” visit. Teachers tend to visit faculty pages in a sequence
of department, faculty member and faculty member’s publication pages. Itemsets (1) and (5)
indicate that in most cases users change the language of department pages within their sessions.
The “study” behaviour type is popular among student visits. They consist of pages in sequence
of faculty member, publication and project or reference categories, such as (6), (8), (9) and (10).
6.4.4 Conclusion
The study of frequent itemsets indicated that the most significant behaviour type is “research” in
almost every user group. More than 50% is the proportion of research visits among all sessions.
In case of the geographical groups “other” also contains more than 50% of support for research
pages while among the organizational groups “staff” has a higher visit rate for this type of
pages. The “study” behaviour has a relatively low base among all visits. However, the
geographical “nl” and the organizational “student” groups have a high visit rate for the “study”
custom. The “free time” behaviour type has a base of approximately 20% within all sessions.
This high visit rate is also typical within the “nl” geographical group but not apparent
significantly among sessions belong to the organizational groups. However, the “staff” has a
relatively large visit rate for photo galleries of faculty members.
51
6.5 The mixture model
We drew the basic inferences of the frequent itemsets in the previous section. We try to refine
the established custom characteristics in this passage with an analysis of mixture models (MM)
for each session group. The mixture model implementation used for modelling session data was
developed in a different project [17].
A mixture model can be viewed as a clustering of all the users. Each cluster is characterized by
a vector of frequencies (tetas) with which members of such a cluster visit specific pages. These
frequencies can be visualized in a bar chart - a kind of a "group profile". Additionally, the
parameter alpha can be interpreted as the cluster size. To interpret the charts in the following
sections it is necessary to look at the legend (refer to table 6: Description of the content-types in
section 6.1.2 Mapping table).
In our experiments we run the MM algorithm with 10 different settings for mixture component
numbers (starting with 1 component up to 10 components mixtures) for building models on
each data set. We set the algorithm to iterate through the model building process 10 times for
each mixture model and to choose the most probable model for each component setting. We use
log-probability scores (“logp scores”) to evaluate the predictive power of the models. Logp
scores are calculated based on the formula of Notion 5.5 (in chapter 5) transformed to the
logarithm of the expression. Higher logp scores mean that the model is evaluated to be more
probable on the dataset. In most cases we put only the figure of the most probable mixture
models in the thesis.
6.5.1 The analysis of all visits
The figure below presents the logp scores of all the 10 mixture components settings.
5
-7.5
LogLikelihood(#iterations, #clusters)
x 10
1
2
3
4
5
6
7
8
9
10
-8
LogL
-8.5
-9
-9.5
-10
-10.5
1
2
3
4
5
6
7
Number of iterations
8
9
10
Figure 6: Logp scores of the 10 mixture component settings of all visits
52
The mixture model with the trivial one component shows only data statistics similar to frequent
one-itemsets which we already discussed in the previous section. We choose the maximal
number of mixture components heuristically. The logp scores for models with a number of
components higher than 6 or 7 tend to have the same characteristics and are “close” to each
other. Therefore we chose the most probable model by the analysis of 2 to 7 components
models.
LogL=-952781.3177
0.25
α =0.58
0.2
0.15
0.1
0.05
0
0
2
4
6
8
10
12
14
16
18
20
0
2
4
6
8
10
12
14
16
18
20
0.4
α =0.42
0.3
0.2
0.1
0
Figure 7: Two-component mixture model of all visits
The histograms in the mixture model above present clusters of similar users. Alphas are their
sizes and the levels of histogram values present “interests” of members of these clusters. The
first component of figure 7 shows the “research” and “study” behaviour types and the second
presents a mixture of “free time” and “study” activities. These mixtures within the base
components indicate that the number of components is probably higher than two.
The analysis of all figures resulted in choosing the model with six mixture components as the
most probable (figure 8). The first base component refers to a “research” behaviour that has a
very high (0,27) probability. The second component also has a high probability and shows a
“study” behaviour type with the visiting of faculty member, publication, reference and course
pages and downloading of course materials. The third component refers to the “student page
visit” custom with visiting of English and Dutch student pages. It also has some visits to Dutch
course and faculty member pages. Component number four is presented in almost all mixture
models for all visits regardless to component numbers (numbers above two components) and
presents no interpretable information given that the miscellaneous category mostly refers to
framset or empty pages. Component five refers to “determined research downloads” behaviour
which means that users know exactly the URL of the material they want to download. The last
component presents “free time” visit model in that users visit photo galleries. The visited photos
belong mostly to faculty members.
53
α =0.031
α =0.1
α =0.15
α =0.21
α =0.24
α =0.27
LogL=-830947.0322
0.5
0
0.4 0
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
0.2
0
0.4 0
0.2
0
10
0.5
0
10
0.5
0
10
0.5
0
0
Figure 8: Six-component mixture model of all visits
6.5.2 The analysis of the geographical groups
We chose also six as the most probable number of components in the case of the geographical
groups. The first component of the mixture model of the “nl” geographical group in figure 9
presents the “free time” behaviour by visiting Dutch student pages at a probability rate of 0.28.
The second most probable component (nr. 2) refers to “study” visits that probably start on
Dutch department pages, go on to Dutch course pages, and finally download course materials.
The “research” component (nr. 3) also has a high probability and refers to the “classical”
sequence of department pages, faculty member pages, and members’ publication pages. In case
of component four miscellaneous pages are combined with department, faculty member, and
student pages. This could mean that the structures of these pages are based on frames but no
visiting characteristics can be observed. The “determined research downloads” habit is
presented in component five in a proportion of 7,6% to all sessions. The last component is a
mixture of “free time” visits. It contains visits for faculty members’ and students photo pages as
well as for activity pages.
54
α =0.054 α =0.076 α =0.16
α =0.21
α =0.22
α =0.28
LogL=-240571.6526
1
0.5
0
0.4 0
2
4
6
8
10
12
14
16
18
20
0
0.5 0
2
4
6
8
10
12
14
16
18
20
0
0.5 0
2
4
6
8
10
12
14
16
18
20
0
10
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
0.2
0.5
0
0.5 0
0
0
Figure 9: Six-component mixture model of the geographical “nl” group
The probability of the “research” behaviour is much higher within the geographical “other”
group than in case of the “nl” model. The first and second components of figure 10 together
form more than 50% of probability for research pages. The first component can also model the
“study” custom for foreign students. Component three models the interest in student pages. This
component also has a relatively large probability. “Determined research downloads” has
approximately 10% of proportion among sessions in this group. The “photo viewing” habit is
presented with a low probability in the last component.
55
α =0.042
α =0.1
α =0.16
α =0.17
α =0.25
α =0.27
LogL=-354766.0415
0.4
0.2
0
0.5 0
2
4
6
8
10
12
14
16
18
20
0
0.4 0
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
0.2
0
10
0.5
0
10
0.5
0
10
0.5
0
0
Figure 10: Six-component mixture model of the geographical “other” group
6.5.3 The analysis of the organizational groups
Figure 11 shows the six-component mixture model of the organizational “staff” group. High
presence of English and Dutch department pages in the first component (from the top) without
any other significant category may imply that the web browser clients of staff members’
machines are set to show department pages as start pages. Component two also shows such
habit, with the difference that teachers may set their own home page as starting page with 0,29
probability. The third component shows interest on faculty member and research pages.
Teachers may look for colleagues’ contact information. Component four refers to a “determined
student page visits” behaviour and five to a “determined download” habit. The last component
also shows that photo pages are visited with direct request for the pages. But the “photo
viewing” behaviour is not popular within this group. Most of the components don’t contain
department pages. This fact indicates that most of the users within this group know the URLs
for the required resources.
56
α =0.039 α =0.08 α =0.089 α =0.18
α =0.29
α =0.32
LogL=-19836.6606
0.5
0
10
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
0.5
0
0.4 0
0.2
0
10
0.5
0
10
0.5
0
10
0.5
0
0
Figure 11: Six-component mixture model of the organizational “staff” group
Component one and five most probably refer to a “study” habit in the six-component mixture
model of the “student” group (figure 12). The third component implies the classic “research”
sequence. Component four represent the “free time” visit behaviour for Dutch student page
visits. This group also contains the “determined download” habit represented by component
five. The last component also shows some kind of “free time” activity with visits for activity
pages and possible downloads of registration forms for free time events.
57
α =0.068 α =0.11
α =0.15
α =0.19
α =0.23
α =0.25
LogL=-26581.5718
0.4
0.2
0
10
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
0.5
0
0.4 0
0.2
0
10
0.5
0
10
0.5
0
0.4 0
0.2
0
0
Figure 12: Six-component mixture model of the organizational “student” group
6.5.4 Conclusion
The results of mixture model analysis show the same major characteristics as the results of
frequent itemsets mining. The “research” behaviour type is the most probable visit activity
among all visits followed by “study” and “free time” habits. The geographical “nl” group
contains more “free time” while the “other” group have a higher visit rate for “research” pages.
Sessions among the “staff” group are more likely to be research or start up (department) pages
whilst the “student” group contains more visits for behaviour types like “study”, “research” and
“free time”.
58
6.6 The global tree model
In contrast with the previous models the global tree model (GTM) is based on the sequential
information presented for sessions (in term of consecutive page visits). The tree model provides
frequent navigational paths and tree-like visualization of relevant patterns.
The analysis of all raw sessions for user groups would result in large, slightly informative, plain
trees. Since we want to analyse “complex” user navigational paths we strip out one-length
sessions. One-length sessions are generated mostly by users following links of search result
pages, starting page settings for web clients and direct visits. Either way these items shift the
whole characteristics of user behaviours. We also eliminate consecutive redundant elements
within sessions (e.g., we analyse the sequence of 6 8 10 instead of 6 8 8 10). This
transformation gathers up all sessions with the same characteristics and preserves the ordering
information. In the following experiments we insert only partial trees or trees referring to
sessions with a relatively high support rate. Furthermore we refer to “total” trees in APPENDIX
C4 – C8 for each group. The CD-ROM contains additional tree visualization figures in high
resolution.
6.6.1 The analysis of all visits
Figure 13 refers to the tree visualization of all visits by 3% of support threshold:
Figure 13: The tree model of all visits (3% of support treshold)
This partial tree shows that “research” is the most important behaviour type among all visits.
29% of the sessions start with faculty member pages and go on to publication pages. Table 24
(and the figure in APPENDIX C4) shows that surprisingly, only a relatively low number of
sessions start on the department pages. Most of the users go directly to faculty members’ pages
and browse members’ publication pages from there. If a user starts from department pages he
continues mostly on faculty member pages, as shown in session type (8). It is interesting that
16% of the sessions that start on faculty members’ pages go on to the department pages. This
activity is the opposite one might expect. A relatively large proportion of sessions start directly
59
on publication pages. 17% of these sessions end with downloading of documents whilst 19% of
them end with visiting reference pages.
Frequent sessions of all visits by 1% of support treshold
session
1 (8) E/person/faculty, (10) E/person/faculty/publication
(2) miscellaneous, (7) E/project, (2) miscellaneous, (7)
2 E/project, (2) miscellaneous, (7) E/project, (2)
miscellaneous, (7) E/project
3 (8) E/person/faculty, (6) E/department
4 (8) E/person/faculty, (4) E/reference
5 (8) E/person/faculty, (11) E/course
6 (8) E/person/faculty, (1) photo
7 (10) E/person/faculty/publication, (18) documents
8 (6) E/department, (8) E/person/faculty
(13) D/person/student, (2) miscellaneous, (13)
9
D/person/student
10 (8) E/person/faculty, (18) documents
11 (10) E/person/faculty/publication, (4) E/reference
12 (11) E/course, (19) other documents
frequency
1344
percentage
3%
887
2%
705
600
582
571
505
478
2%
1%
1%
1%
1%
1%
454
1%
444
439
397
1%
1%
1%
Table 24
6.6.2 The analysis of the geographical groups
Figures 14 and 15 and in the APPENDIX C5 and C6 contain the most frequent navigational
paths of geographical groups. As we stated earlier, users within the “other” group tend to visit
“research” pages more frequently than within the “nl” group. 34% of their visits start on faculty
member pages and then the majority go on to publication pages. Some of these visits end with
downloading of documents or returning to member pages. The “nl” group also contains a large
proportion (18%) of “research” pages. However the referring branch in the “nl” tree is a mixture
of contents. It contains student and photo pages as well. 11% of visitors among the “other”
group use faculty member pages to visit E/course materials. No such behaviour can be observed
among the “nl” group. Quite the contrary, the “nl” group doesn’t contain E/course pages at all
among its frequent navigational paths. 13% of the faculty member pages’ visitors go and see
photo pages of the members. This proportion is twice as much as in case of the “nl” group.
Visitors among the “other” group use the department pages more frequently. Most of these
visits are likely to go on to the faculty members’ pages in both the “other” (55% of them) and
the “nl” (26% of them) groups. In case of the “other” group, publication and project pages are
also frequent destinations from the department pages. The “nl” group tends to have more “free
time” visits than the group “other”. 19% of the visits related to the “nl” group contain student
pages and 20% of them include photo galleries. The “study” behaviour type does not appear as
an individual (sub)branch of the “nl” tree but study pages are spread around in the tree.
60
Figure 14: The tree model of the “nl”group
Figure 15: The tree model of the “other”group
6.6.3 The analysis of the organizational groups
Figures 16 and 17 and in the APPENDIX C7 and C8 contain the most frequent navigational
paths for the organizational groups. Visits of the “staff” users start mostly on faculty member
pages. They then navigate to publication pages, download materials or simply go to department
pages. The most relevant session structure for the “student” group starts on miscellaneous
pages, then goes to faculty member pages followed by project pages and finally ends either on
publication or reference pages. Reference, project, faculty member, and publication pages are
spread in the whole “student” tree mixed with other components. Both “staff” and “student”
trees are a kind of mixtures. They don’t contain clear user behaviour types, whereas trees for the
geographical groups and for all sessions do. The reason probably is that organizational groups
contain much less sessions.
61
Figure 16: The tree model of the “staff”group
Figure 17: The tree model of the “student”group
6.6.4 The similarity of tree models
Table 25 contains the similarity measures for all tree model pairs of all groups. We “equalized”
all the session data pairs before measuring them. This means that we randomly stripped out
sessions from the greater data set to make the number of sessions equal in all pairs. The
diagonal from the upper left to the lower right corner contains 100% similarity since they refer
to similarities of the same groups. The similarity matrix is symmetrical because of the
commutative property of the similarity measure.
62
Similarity measures for tree models of all user groups
group
geog. –
geog. –
all
nl
other
group all
100%
40.36%
70.46%
geographical – nl
27.76%
40.36%
100%
geographical – other
70.46%
27.76%
100%
organizational – staff
20.29%
20.08%
19.18%
organizational –
21.11%
27.78%
16.59%
student
org. –
staff
20.29%
20.08%
19.18%
100%
org. –
student
21.11%
27.78%
16.59%
23.75%
23.75%
100%
Table 25
According to these figures the “other” group is the most similar to the “all sessions” group. This
is not surprising since the “other” group contains the greatest part within all sessions.
Comparing the “nl” and “other” groups results in 27,76% of similarity while measuring the
distance between the “staff” and “student” yields 23,75% of similarity.
6.6.5 Conclusion
The analysis of the tree models mostly confirms our preliminary assumptions (in the AR and
MM sections – for the details refer to section 6.4 and 6.5) for ordering of pages in typical
sessions. However in some cases it turned out that the expected orders are not realistic. Most of
the groups contain the subsequence of faculty member pages and department pages in a higher
frequency than department pages and faculty member pages. One would expect the opposite of
the phenomenon.
63
7 Conclusion and future work
In our work we have presented a methodology for web usage mining. We discussed data
preprocessing and data enrichment processes of access log entries of web servers. Data
enrichment is about integrating content types of documents with access log entries. The
enriched data is structured into user navigational sequences in the next step. With the
application of geographical and organizational data we have set up user groups for users and
their related sessions. We presented three data mining models for exploring user behaviours
among groups of users: the association rules mining algorithm was used to explore frequent
itemsets and rules on them. The mixture model presented a clustering of users by similar
collections of pages they visit. Thirdly the global tree model was proposed for mining frequent
navigational paths with the preservation of sequential information of page visits. Visualization
of the tree models facilitate human perception and in this manner helped to obtain the most
important patterns.
Finally we applied all the discussed techniques to the web site of the Computer Science
Department of Vrije Universiteit, The Netherlands (www.cs.vu.nl domain).
We discovered three significant types of user behaviours analysing the experimental results.
These types are “research”, “study” and “free time”. Sessions belonging to the “research”
behaviour type consist mostly of faculty members’ pages, their publication pages, reference and
project pages. They include department pages for navigations and downloads of (scientific)
documents. The “study” custom mostly refers to Dutch and English course pages but also
contains reference and project pages in large numbers. The “free time” visits consist mostly of
photo pages, activity pages and Dutch and English student pages. Other minor behaviour types
are described within the analysis of the models.
In general the “research” custom is the far most popular among all sessions and among most of
the session groups. Study pages are not as popular as research pages but have a significant base
within all sessions. The “free time” habit is the least popular among behaviour types but it still
has a relatively large support among all sessions.
We categorized all the user sessions into the four subcategories of geographical and
organizational group categories. Geographical categories are the “nl” and the “other” groups,
where “nl” refers to user sessions related to users from the Netherlands and “other” consists of
sessions for users from all the other countries. Organizational categories are the “staff” and
“student” categories referring to the sessions of the staff and student users of the university.
The “research” custom is the most frequent among users of the “other” group. Approximately
half of the sessions within this group relate to this custom. “Staff” has more research sessions
within organizational groups. The “study” custom is the most frequent in the geographical “nl”
and the organizational “student” groups. The “other” group also contains a large number of
visits for English course pages. Free time visits are the most popular among “nl” sessions and
have a significant base within “student” visits. The “staff” group also contains a significant
proportion of sessions containing pages of faculty members’ photos.
Surprisingly, department pages are infrequent among starting pages of user sessions. This
indicates that most of the users don’t use department pages for page lookups. However a
popular scenario of user visits is to start the visiting sequence in faculty member pages and then
go on to department pages to navigate to the following destination. Another conclusion of the
results is that course pages are not so popular among students. This may indicate that students
64
visit course pages mostly from home. However a significant proportion of students visit
reference and project pages from the labs of the University. This may imply that they mostly use
the facilities of the University for solving group assignments. It is to be remarked that the Vrije
Universiteit has a dedicated Intranet system (the Blackboard) for managing all the information
about courses. Albeit most of the courses have informational pages within the VU-pages, this
system may provide extra information for users. User visits to the Blackboard system were not
tracked within this project. Another important pattern is that users from abroad tend to visit
English course pages. These sessions can be generated either by students looking for course
materials for their studies or by foreign teachers read up on courses information.
The analysis covers only a short period (one month: June) and the observed patterns certainly
change over time. This fact may explain some “extreme” patterns covered by data analysis. To
avoid “periodical” patterns it would be interesting to perform the data analysis automatically,
e.g., once per a month.
We tried to develop as accurate algorithms as possible but there are some internal and external
limitations that are influenced by or might influence the experimental results. Web logs of
public web domains provide insufficient information on users. Some of the identified users and
their sessions may contain incorrect data despite the applied heuristics for identification
processes. The accuracy can only be improved by using cookies or other external identification
techniques (refer to section 6.3 Data structuring). The problem is completely solved in case of
the analysis of an Intranet (login required) application because of the automatic user
identification.
Another problem is the high number of mapping errors that occur either because some requests
refer to a deeper level in HTML pages’ structure than was set or because the requested pages
were removed in the meantime. Therefore the number of mapping errors can partly be reduced
by downloading the VU-pages in deeper levels. The exponential growing of page numbers
would overload the content classifier algorithm though. A much more sophisticated solution
would be to build a separate content retrieval system where all the pages in a website would
have at least one URL, content type, and timestamp entry. Each time a page is changed a new
entry would refer to its new content type in the system (in case it is different from the previous
one). During the analysis of access log files, their timestamp and the timestamp of content
entries would be compared and the suitable content label would be chosen.
The accuracy of content labelling is the most critical part of the whole process. The actual 74%
of average accuracy of content categories assures the reliability of the major user characteristics
that were experienced. However, increasing the accuracy would result in lower “noise” of the
data and even more reliable experiments.
The web usage mining system proposed in this thesis was built basically for “static” analysis.
This means that the access log files, the VU-pages and all the other input data were evaluated
“offline”, independently from their generation time. A potential improvement of the system
would be to process access log entries and to attach labels to the requested documents “online”,
at the same time when they are generated. This would allow us to analyse systems dynamically.
I will go on with my researches within the DIANA project (http://www.cs.vu.nl/ci/DataMine/
DIANA/) focusing on the real time analysis of dynamic systems and on the development of
new, adaptive algorithms for mining data streams.
65
Acknowledgements
I wish to express sincere appreciation to Dr. Wojtek Kowalczyk for his assistance and insight
throughout the development of this project. I would also like to express sincere thanks to Dr.
Elena Marchiori for her valuable advice and feedback.
In addition, special thanks to Dr. Frits Daalmans, for technical and non-technical advice,
support, and editing. I also thank to Krisztián Balog for the fruitful cooperation during the
project and for providing the content label data for the VU-pages.
I thank to Dr. Elisabeth Hornung, my Mom for her advice and suggestion, and many thanks to
my family for supporting me during the one year in the Netherland. I would also like to say a
special thanks to my lovely girlfriend Maya, for her patience and for pushing me over the final
finishing line.
I would like to express my special thanks to the Vrije Universiteit, Amsterdam for the
opportunity to participate the one year International Master Program.
66
Bibliography
1. Agrawal, R., Imielinski, T., and Swami, A. (1993), Mining association rules between sets of
items in large databases. In Proceedings of the ACM SIGMOD Conference on
Management of Data, pages 207–216
2. Baglioni, M., Ferrara, U., Romei, A., Ruggieri, S., and Turini, F. (2003), Preprocessing and
Mining Web Log Data for Web Personalization. 8th Italian Conf. on Artificial Intelligence
vol. 2829 of LNCS, p.237-249
3. Balog, K. (2004), An Intelligent Support System for Developing Text Classifiers. MSc.
Thesis, Vrije Universiteit of Amsterdam, The Netherlands
4. Cadez, I. V., Heckerman, D., Meek, C., Smyth, P., and White, S. (2003), Model-Based
Clustering and Visualization of Navigation Patterns on a Web Site. Data Mining and
Knowledge Discovery, vol.7 n.4, p.399-424
5. Cadez, I.V., Smyth, P., Ip, E., and Mannila, H. (2001), Predictive Profiles for Transaction
Data using Finite Mixture Models. Technical Report No. 01–67, Information and
Computer Science Department, University of California, Irvine
6. Chen, Z., Fu, A., and Tong, F. (2002), Optimal Algorithms for Finding User Access Sessions
from Very Large Web Logs. Proceedings of the Sixth Pacific-Asia Conference on
Knowledge Discovery and Data Mining, (PAKDD), Taipei
7. Chen, M., Park, J.S., and Yu, P.S. (1998), Efficient data mining for path traversal patterns.
IEEE Transactions on Knowledge and Data Engineering, vol.10 n.2, p.209-221
8. Chevalier, K., Bothorel, C., and Corruble, V. (2003), Discovering rich navigation patterns on
a web site. Proceedings of the 6th International Conference on Discovery Science
Hokkaido University Conference Hall, Sapporo, Japan
9. Cho, Y.H., and Kim, J.K. (2004), Application of Web usage mining and product taxonomy to
collaborative recommendations in e-commerce. Expert Systems with Applications vol.26,
p.233–246
10. Cho, Y.H., Kim, J.K., Kim, S.H. (2002), A personalized recommender system based on web
usage mining and decision tree induction. Expert Systems with Applications vol.23, p.329–
342
11. ClickTracks. Retrieved February 12, 2004 from http://www.clicktracks.com/
12. Coenen, F. (2004), The LUCS-KDD Apriori-T Association Rule Mining Algorrithm,
http://www.cxc.liv.ac.uk/~frans/KDD/Software/Apriori_T/aprioriT.html, Department of
Computer Science, The University of Liverpool, UK.
13. Cooley, R., Mobasher, B., Srivastava, J. (1999), Data Preparation for Mining World Wide
Web Browsing Patterns. In Knowledge and Information System, vol.1(1), p.5-32
14. Hay B., Wets, G., and Vanhoof K. (2003), Segmentation of visiting patterns on websites
using a sequence alignment method. Journal of Retailing and Consumer Services vol.10,
p.145–153
15. Jacobs, N., Heylighen, A., and Blockeel, H. (2001), Dynamic Website Mining. Proceedings
of the European Symposium on Intelligent Technologies, Hybrid Systems and their
implementation on Smart Adaptive Systems, Tenerife (Spain)
16. Jenamani, M., Mohapatra, P.K.J., and Ghose, S. (2003), A stochastic model of e-customer
behaviour. Electronic Commerce Research and Applications vol.2, p.81–94
67
17. Mixture model implementation within the DIANA project
http://www.cs.vu.nl/ci/DataMine/DIANA/
18. Mobasher, B., Jain, N., Han, E., and Srivastava, J. (1996), Web Mining: Pattern discovery
from World Wide Web transactions. Technical Report TR 96-050, University of
Minnesota, Dept. of Computer Science, Minneapolis
19. Zaki, M. J. (2002), Efficiently Mining Frequent Trees in a Forest. SIGKDD ’02 Edmonton,
Alberta, Canada
20. Nanopoulos, A., and Manolopoulos, Y. (2000), Finding Generalized Path Patterns for Web
Log Data Mining. J. Stuller et al. (Eds.): ADBIS-DASFAA, LNCS 1884, p.215-228
21. Nanopoulos A., Manolopoulos Y. (2001), Mining patterns from graph traversals. Data and
Knowledge Engineering No. 37, p.243-266
22. OneStat.com. Retrieved February 12, 2004 from http://www.onestat.com/
23. Pei, J., Han, J., Mortazavi-asl, B., and Zhu, H. (2000), Mining Access Patterns Efficiently
from Web Logs. Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery
and Data Mining, p.396-407
24. Punin, J.R., Krishnamoorthy, M.S., and Zaki, M.J. (2001), LOGML: Log Markup Language
for Web Usage Mining. Proceedings in WEBKDD Workshop 2001: Mining Log Data
Across All Customer TouchPoints (with SIGKDD01), San Francisco
25. Fielding R., Gettys, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T. Hypertext
Transfer Protocol - HTTP/1.1. Network Working Group. RFC 2616.
26. Berners-Lee, T., Fielding, R. and H. Frystyk. Hypertext Transfer Protocol - HTTP/1.0.
Network Working Group. RFC 1945.
27. Runkler, T.A., Bezdek, and J.C. (2003), Web mining with relational clustering.
International Journal of Approximate Reasoning vol.32, p.217–236
28. Smith, K.A., and Ng, A. (2003), Web page clustering using a self-organizing map of user
navigation patterns. Decision Support Systems vol.35, p.245– 256
29. Spider pattern lists were verified on the services listed below. Retrieved March 17, 2004
from http://www.robotstxt.org (list of well known robots - not up to date),
http://www.spy-bot.net/list_adware.asp (list of spiders) and using and
http://www.google.com search engine for missing or uncertain spiders.
30. Xing, D., and Shen, J. (2004), Efficient data mining for web navigation patterns.
Information and Software Technology vol.46, p.55–63
31. Yang, Q., Li T.I., and Wang K. (2003), Web-log Cleaning for Constructing Sequential
Classifiers. Applied Artificial Intelligence vol. 17, iss. 5-6, p.431-441
32. Yao, Y., Hamilton, H.J., and Wang, X.W. (2000), PagePrompter: An Intelligent Agent for
Web Navigation Created Using Data Mining Techniques. Technical report, Department of
Computer Science, University of Regina Regina, Saskatchewan, Canada
33. Youssefi, A.H., Duke, D.J., Zaki, M.J., and Glinert, E.P. (2003), Towards Visual Web
Mining. In Proceeding of Visual Data Mining at IEEE Intl Conference on Data Mining
(ICDM), Florida
34. Luotonen, A. (1995), The Common Logfile Format,
http://www.w3.org/pub/WWW/Daemon/User/Config/Logging.html
35. Extended Log File Format - W3C WD-logfile-960221, W3C Working Draft WD-logfile960221. http://www.w3.org/TR/WD-logfile.html
36. GNU Software Foundation (1999), Wget. Available at
http://www.gnu.org/software/wget/wget.html.
37. Webtrends, Retrieved February 12, 2004 from http://www.netiq.com/products/log
68
APPENDIX
APPENDIX A. The uniform resource locator (URL)
Uniform resource locators (URL) identify resources in the World Wide Web. The syntax of an
HTTP URL is 'http://' host.domain [':'port] [ path ['?' query]] where
- host.domain is the name of the web service (server)
- port is optional (default is 80)
- path is the absolute location of the requested resource in the server (consists of path +
file name + extension with delimiter fields)
- query is a collection of parameters in case of dynamic pages
APPENDIX B. Input file structures
1. The structure of the properties file
The properties file contains the most adjustable properties for the webmining package in form
of key/value pairs in each line. Pairs are delimited by ‘=’ character. Supported properties are
described in the table below:
Supported properties of the properties file
Database properties
Name of the JDBC driver for database connection.
JDBC_driver_name
e.g., com.mysql.jdbc.Driver
Name of the database connection. e.g.,
connection_name
jdbc:mysql://localhost/test
user_name
Name of the user for the specified database. e.g., TEST
user_password
Password for the user. e.g., test
log_table_name
Name of the access log table. e.g., cslog
log_users_table_name
Name of the users table. e.g., users
Properties for data handling
Path and file name of the (merged) access log file. e.g.,
access_log_path
c:\log.txt
Properties for transaction filtering
default_page_name
Name of the default HTML page. e.g., index.html
accepted_extensions_list
Path and file name of the extension list file.
spider_engines_list
Path and file name of the spider list file.
Properties for session identification
Length of time frame (in minutes) for time frame
time_frame_intervall
identification. e.g., 30
Type of group selector. Possible values: all, subnets,
country. They refer to all sessions (all), only sessions
generated by a user specified by the file given by the
group_selector_type
network_range_file_name key (subnets) and sessions
generated by a user specified by the file given by the
country_list_file_name key (country).
network_range_file_name
Path and file name for the file specifying a subnet group.
country_list_file_name
Path and file name for the file specifying a country group.
69
Properties for data integration
mapping_table_path
Path and file name for the content mapping table file.
Path and file name for the artificial content mapping table
generated_mapping_table_path
file to be generated.
Properties for geographical statistics
Path and file name for the country codes file containing the
country_codes_file_name
names and short names for most of the countries in the
World.
Table 26: Supported properties of the properties file
2. The structure of access log files
The apache web server of the www.cs.vu.nl domain uses the extended log file format [35] and
writes the following fields into the log files in order of appearance: remotehost, rfc931,
authuser, date, request, status, bytes, referrer, user_agent. Log files can contain an arbitrary
number of entries containing all the fields described above. Each line of the file should contain
only one entry and each field is separated by one ore more white spaces. The syntactic of an
entry is the following:
Syntactic of access log file
remotehost character string (e.g., 81.69.10.150)
rfc931
character string (e.g., -)
authuser
character string (e.g., -)
given in [dd/MMM/yyyy:HH:mm:ss Z] format (e.g., [30/May/2004:03:30:10
date
+0200])
character string (e.g., "GET /~fbenmba/straatremixes/images/home.png
request
HTTP/1.1")
status
integer (e.g., 200)
bytes
integer (e.g., 12193)
character string (e.g., http://www.google.com.br/search?hl=pt-BR&ie=UTFreferrer
8&q=Andrew&meta=)
user_agent character string (e.g., Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1))
Table 27
3. The structure of the content types mapping table file
A content type mapping table contains URL/content type pairs. Three types of entries can occur
in a mapping file according to its structure:
Description of the file structure of the content types mapping table
The first valuable line should contain the number of distinct content
number of
types (n). Content types considered as integers in range 0 .. n (all
content types
numbers inclusive) .
Following there can be an arbitrary number of URL/content type pair
URL / content
entries. Where each line corresponds to a pair entry. The pair should be
type pairs
delimited with a white space.
The file may contain comment lines anywhere in the structure presuming
textual entries
that a hash mark (‘#’) stands as the first character of the line.
Table 28
70
4. The structure of the extension filter list file
Extension filter list file contains extension entries for filtering not allowed file types. The
structure of such file is as follows:
Description of the file structure of the extension filter list file
extension
Each valuable line of the file should contain a specific file extension
entries
(without any dots or special marks, e.g., ‘html’).
The file may contain comment lines anywhere in the structure
textual entries
presuming that a hash mark (‘#’) stands as the first character of the line.
Table 29
5. The structure of the spider filter pattern list file
Spider filtering is based on the spider pattern list file, which contains recognizable patterns for
known spiders. The file was made by pre-examining log files for cs.vu.nl web server and
filtering out suspicious user agents, extraordinary patterns. These patterns were tested against
spider list providers’ pages like [29].
The file contains spider pattern entries each of them in a separate line. The structure of such a
file is as follows:
Description of the file structure of the spider pattern list file
spider pattern
Each valuable line of the file should contain a specific spider pattern.
entries
The file may contain comment lines anywhere in the structure
textual entries
presuming that a hash mark (‘#’) stands as the first character of the
line.
Table 30
APPENDIX C. Experimental details
1. Spider pattern list and rank
Table 31 contains spider patterns ranked by frequency counts during the www.cs.vu.nl access
log file analysis.
71
Spider pattern list and rank
24
ROADRUNNER
rank
pattern
frequency
25
YAHOO
1
26
NET CLR
1228717
SCOOTER
2
27
BOT
576740
FLASHGET
3
28
WGET
220265
INFOSEEK
4
29
FUNWEBPRODUCTS 197637
WEBSEARCH
5
DIGEXT
194020
30
PITA
6
T-H-U-N-D-E-R-S-TCRAWLER
101244
31
O-N-E
7
SLURP
90338
32
FLUFFY
8
HOTBAR
48969
33
NETNEWSWIRE
9
JEEVES
48511
34
WEBDUP
10
HTTRACK
21432
35
WEBVAC
11
IA_ARCHIVER
17058
36
VIAS
12
GRUB-CLIENT
9599
37
ZYBORG
13
YCOMP
7246
38
TEOMAAGENT
14
LIBWWW-PERL
4476
39
GULLIVER
15
APPIE
3705
40
ARCHITEXT
16
AVSEARCH
2563
41
MERCATOR
17
SPIDER
1432
42
ULTRASEEK
18
TELEPORT
1316
43
MANTRAAGENT
19
WEBCOPIER
794
44
MOGET
20
WEBCOLLAGE
528
45
MUSCATFERRET
21
DVD OWNER
524
46
SLEEK
22
FREESURF
490
KIT_FIREBALL
23
LYCOS
467
Table 31: Spider pattern list and rank
461
389
194
187
87
29
23
14
4
3
2
0
0
0
0
0
0
0
0
0
0
0
0
0
2. Extension list and rank
Table 32 contains extensions ranked by frequency counts during the analysis of www.cs.vu.nl
access log files. We listed here the top 100 most frequent extensions leaving out some unknown
and infrequent items.
72
Extension list and
frequeny
rank extension
gif
1
jpg
2
html
3
js
4
png
5
pdf
6
css
7
php
8
htm
9
ico
10
pac
11
txt
12
ps
13
mp3
14
php3
15
gz
16
zip
17
1
18
doc
19
wrl
20
bmp
21
3
22
ppt
23
2
24
taz
25
tar
26
class
27
tgz
28
z
29
shtml
30
swf
31
misc
32
jpeg
3584
33
xml
3471
34
freq.
pl
3334
35
1919367
asp
3333
36
1785872
wma
3265
37
1705041
mid
3211
38
353113
eps
3154
39
199905
java
2973
40
138413
tex
2839
41
127857
c
2485
42
111405
rdf
2373
43
105481
jar
2336
44
58851
8
2332
45
50358
dcr
2328
46
48777
5
2108
47
45563
cgi
2017
48
31337
9
2011
49
27459
wmv
2009
50
25263
xbm
1943
51
19123
mnx
1935
52
13988
hs
1909
53
11046
pas
1893
54
9831
announce
1878
55
9283
wav
1858
56
8254
4
1575
57
7931
xls
1516
58
7851
exe
1373
59
6833
tab
1332
60
6772
h
1189
61
5822
spf
1160
62
4860
xhtml
1083
63
4494
dvi
1058
64
4470
1
1013
65
3945
imp
845
66
3699
bib
830
67
Table 32: Extension list and frequeny
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
xsl
stdout
rss
bak
idx
smi
m
dtd
pn
avi
cur
nl
readme
wmz
old
fst
cpp
mpg
log
ref
owl
rtf
au
com
pps
fla
sgml
aux
ram
srt
mpeg
rar
bat
793
788
770
767
710
699
696
652
592
587
570
530
493
484
428
424
420
362
344
338
331
304
282
233
205
183
171
166
146
132
128
122
119
3. Geographical distribution of users visiting www.cs.vu.nl
The table below contains the geographical distribution of users visiting www.cs.vu.nl during the
observed period.
73
Geographical distribution of users
rank TLD count country
1
nl
19248 netherlands
network
2
net
18299
infrastructure
3
com 11457 commercial
4
fr
3125
france
5
be
3058
belgium
6
de
3001
germany
7
ca
2133
canada
8
it
2038
italy
9
uk
1903
united kingdom
10
au
1852
australia
educational
11
edu
1803
establishments
(primarily us)
12
jp
1532
japan
13
br
1485
brazil
14
ch
963
switzerland
15
mx
935
mexico
16
pl
878
poland
17
at
635
austria
18
fi
610
finland
19
dk
553
denmark
20
se
531
sweden
21
ar
498
argentina
22
es
491
spain
23
gr
471
greece
24
hu
443
hungary
25
no
393
norway
26
us
374
united states
other
organizations not
27
org
352
clearly
falling
within the other
gtlds
28
nz
341
new zealand
29
il
313
israel
russian
30
ru
302
federation
31
pt
301
portugal
32
sg
275
singapore
33
mil
273
us military
34
cz
261
czech republic
35
gov
235
us government
36
tr
233
turkey
37
cl
228
chile
taiwan, province
38
tw
226
of china
39
ro
210
romania
40
in
151
india
41
hr
128
croatia
42
sk
116
slovakia
43
za
114
south africa
44
ma
104
morocco
45
lt
102
lithuania
46
hk
101
hong kong
47
uy
94
uruguay
74
48
49
50
51
52
ie
th
ee
sa
co
93
90
86
83
79
53
do
79
54
55
56
57
58
my
id
kr
ua
si
67
65
60
54
54
59
yu
49
60
61
62
63
64
65
66
67
68
ph
is
cy
bg
ve
lu
mu
int
cn
49
47
38
34
32
28
27
26
24
69
tt
22
70
71
72
73
74
75
76
77
lv
py
ec
cr
np
pk
pe
lb
19
19
19
16
14
13
12
12
78
md
11
79
80
81
82
83
84
85
86
87
nu
by
fj
ni
ke
aw
mz
mt
jo
10
9
9
9
9
8
7
6
6
88
bn
5
89
bw
5
90
arpa
5
91
92
93
94
95
cu
qa
na
zw
aero
5
5
5
4
4
ireland
thailand
estonia
saudi arabia
colombia
dominican
republic
malaysia
indonesia
korea, republic of
ukraine
slovenia
yugoslavia (now
serbia
and
montenegro, iso
code
has
changed to cs)
philippines
iceland
cyprus
bulgaria
venezuela
luxembourg
mauritius
null
china
trinidad
and
tobago
latvia
paraguay
ecuador
costa rica
nepal
pakistan
peru
lebanon
moldova,
republic of
niue
belarus
fiji
nicaragua
kenya
aruba
mozambique
malta
jordan
brunei
darussalam
botswana
address
and
routing
parameter area
cuba
qatar
namibia
zimbabwe
null
96
97
98
kh
bm
su
4
4
4
99
mk
3
100
101
kz
fo
3
3
102
ir
3
103
tz
3
104
105
106
107
108
109
110
111
112
113
114
115
tv
to
tg
biz
sv
al
uz
ad
lk
om
gl
jm
3
3
2
2
2
2
2
2
2
2
2
2
116
cc
2
117
mg
2
cambodia
118
sr
2
bermuda
119
ba
1
null
120
cx
1
macedonia, the
121
former yugoslav
nc
1
republic of
122
am
1
kazakhstan
123
sz
1
faroe islands
124
pa
1
iran,
islamic
125
vn
1
republic of
126
ls
1
tanzania, united
127
ge
1
republic of
128
ae
1
tuvalu
tonga
129
pg
1
togo
null
130
rw
1
el salvador
131
bs
1
albania
132
ao
1
uzbekistan
133
ky
1
andorra
134
sm
1
sri lanka
135
bt
1
oman
136
ug
1
greenland
137
st
1
jamaica
138
zm
1
cocos (keeling)
islands
139
az
1
madagascar
Table 33: Geographical distribution of users
75
suriname
bosnia
and
herzegovina
christmas island
new caledonia
armenia
swaziland
panama
viet nam
lesotho
georgia
united
arab
emirates
papua
new
guinea
rwanda
bahamas
angola
cayman islands
san marino
bhutan
uganda
sao tome and
principe
zambia
azerbaijan
4. Global tree model of “all visits” by s = 1,3 support threshold
Figure 18
76
5. Global tree model of “nl” group by s = 1,0 support threshold
Figure 19
77
6. Global tree model of “other” group by s = 1,5 support threshold
Figure 20
78
7. Global tree model of “staff” group by s = 1,0 support threshold
Figure 21
79
8. Global tree model of “student” group by s=0,8 support threshold
Figure 22
80
APPENDIX D. Implementation details
All the algorithms required for the tasks described in the thesis were implemented in the Java
language. We used a MySQL database server for data storage and retrieval. Details on the
implementation and database are listed in the table below.
Technical details
Implementation
language
package name
notion
Database
name
version
note
java
webmining
the package is database independent
MySQL
4.0.18
MySQL doesn’t support stored procedures up to version 5.0 (which,
while this thesis was written, was only in beta stadium and as such
unstable)
This fact makes data processing a bit more difficult and less effective,
because some processing steps would like to work directly inside the
database.
However, all the tasks and problems could be done with proper
efficiency.
Table 34
Our webmining package contains six major subpackages: datahandling, dataintegration,
sessionidentification, patterndiscovery, stats and visualization. All the main classes belonging to
these packages are listed below with brief descriptions.
1. Data preparation (cleaning, filtering, loading) –
webmining.datahandling package
Main objects of the webmining.datahandling package
DatabaseConnection
Handles database connection (based on the properties file).
HostNameLookup
Provides methods for IP address – domain name lookup.
LoadLog
The main object which manages the cleaning and loading process
Log2Database
Loads the prepared transactions into the database.
The parser object which parses the input raw log file into useful
LogParser
Transactions.
Transaction
This object stores all information of a log entry in parsed format.
Filter object that can filter out useless or not supported
TransactionFilter
transactions.
Simplified transaction object for log information retrieval from the
TransactionSimple
database.
Updates the users table with host names for corresponding
UpdateDBHostNames
remotehost fields.
Updates the cslog table’s remotehost fields with IP addresses in
UpdateDBIPAddresses
case they contain host names.
Data files used by the package
cslog.txt
Text file containing log entries in raw format
Properties file that contains all the properties needed for the
webmining.prop
process (e.g., database properties, file paths and file names, etc.).
extension.flt
This file contains all the file extensions for request URLs that are
81
spider.flt
supported by this project.
This file contains all known spider engine names or spider patterns
for filtering out spider transactions.
Table 35
2. Data preparation (integration) – webmining.dataintegration
package
Main objects of the webmining.dataintegration package
The main process for generating an artifical mapping table
GenerateAMT
using GenerateArtificialMappingTable object.
Generates artificial mapping data from the specified
GenerateArtificalMappingTable access log file, with randomly added content types in the
given interval.
Representation of the mapping table. It reads the mapping
information, (URL, content type) entries, from the specified
MappingTable
text file and stores them in an effectively searchable
HashTable.
Data files used by the package
Text file containing mapping entries for the specific
mapping_table.mtd
collection of documents (HTML pages).
The properties file which contains all the properties which
webmining.prop
are needed for the process (e.g., database properties, file
paths and names)
Table 36
3. Data structuring – webmining.sessionidentification package
Main objects of the webmining.sessionidentification package
GetSessions
The main object which manages the identification process.
Interface for all identifier objects. Describes that an identifier
Identifier
should make sessions from a given set of user page access
entries (from an Array of TransactionSimple objects).
This is an identifier object, which identifies sessions by
MFRIdentifier
maximal forward reference method.
Provides methods for printing identified user sessions in
SessionFormatPrinter
different output formats into the specified output file.
This is an identifier object, which identifies sessions using time
TimeFrameIdentifier
frame identification method.
This object retrieves user page accesses for every user
separately and invokes the specified identifier on the collected
TransactionDBIterator
data. As a result it gives back the identified sessions. (It is
(deprecated class)
much slower than memory iterator, thus this class is out of
usage.)
This object retrieves all the page accesses (rather content
types of them) for every user into the memory and invokes the
TransactionMemoryIterator
specified identifier with the collected data. As a result it gives
back the identified sessions.
Data files used by the package
The properties file which contains all the properties which are
webmining.prop
needed for the processes (e.g., database properties, file paths
and names).
Table 37
82
4. Profile mining models – webmining.patterndiscovery package
This package contains two subpackages for association rules mining (assoc) and global tree
model (gtm) implementations.
Main objects of the webmining.patterndiscovery.assoc package
Note, that the LUCS-KDD Apriori-T Association Rule Mining
Algorrithm implemented by Coenen, F. (2004) [12] was put into
note
the webmining package structure without any modification. The
following class descriptions are mainly from the documentation
of the program.
AprioriTapp
Fundamental Apriori-T application.
Apriori-T application with input data preprocessed so that it is
AprioriTsortedApp
ordered according to frequency of single items --- this serves to
reduce the computation time.
Apriori-T application with data ordered according to frequency of
single items and columns representing unsupported 1-itemsets
AprioriTsortedPrunedApp
removed --- again this serves to enhance computational
efficiency.
Set of general ARM utility methods to allow: (i) data input and
input error checking, (ii) data preprocessing, (iii) manipulation of
AssocRuleMining
records (e.g., operations such as subset, member, union etc.)
and (iv) data and parameter output.
Set of methods that allow the creation and manipulation (e.g.,
RuleList
ordering, etc.) of a list of ARs.
Methods to implement the "Apriori-T" algorithm using the "Total
TotalSupportTree
support" tree data structure (T-tree).
Methods concerned with the structure of Ttree nodes. Arrays of
these structures are used to store nodes at the same level in any
TtreeNode
sub-branch of the T-tree. Note this is a separate class to the
other T-tree classes which are arranged in a class hierarchy.
Data files used by the package
Plain texts file containing the input user sessions in a special
input_session.txt
format. Page content types within a session are in ascending
order with redundant pages removed.
Table 38
Main objects of the webmining.patterndiscovery.gtm package
GlobalTreeModel
This class provides the representation for the tree model.
Initialize the tree model and loads all the user sessions into it.
LoadGTM
Besides, it is also responsible for managing tree visualization.
SessionTree is a tree structure containing all the sessions for a
SessionTree
specific starting page. The whole model consists of SessionTrees
in a number of distinct content types.
Contains information for one node such as parent and children
TreeNode
references, content type and frequency of the node, etc.
DATA FILES USED BY THE PACKAGE
input_session.txt
Plain text file containing the input user sessions.
Table 39
83
APPENDIX E. Content of the CD-ROM
The additional CD-ROM to the master thesis contains all the input and data files as well as all
the important results that were made during the project. It contains the source and binary code
of the whole webmining package and this master thesis in electronic format. To make the
browsing easier we made an HTML user interface for the provided content. It is accessible from
the root of the CD by opening the “index.html” file.
84