Download Mining Web Access Logs of an On-line Newspaper

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Mining Web Access Logs of an On-line
Newspaper
Paulo Batista and Mário J. Silva
Departamento de Informática,
Faculdade de Ciências – Universidade de Lisboa
Campo Grande
1749-016 Lisboa Portugal
{pb,mjs}@di.fc.ul.pt
Abstract. With the explosive growth of data available on the Internet,
personalization of this information space become a necessity. An important component of web personalization is the automatic knowledge
extraction from web log files. However, analysis of large web log files is
a complex task not fully addressed by existing web access analyzers. Using commercial software, we applied well-known data mining techniques
(association rules and clustering) to analyze access log records collected
on a web newspaper. This paper identifies several reading patterns and
discusses approaches for mining this data.
1
Introduction
The evolution of the Internet has lead to an enormous proliferation of the available information and the personalization of this information space has become a
necessity. The knowledge obtained by learning web users preferences can be used
to improve the effectiveness of their web sites by adapting the web information
structure to the users behavior. Automatic knowledge extraction from web log
files can be useful for identifying such reading patterns and infer user profiles.
However, it is hard to find appropriate tools for analyzing raw web log data to
retrieve significant and useful information. There are several commercially available web log analysis tools [1], but most of them are disliked by their users and
considered too slow, inflexible, expensive, difficult to maintain or very limited in
the results they can provide [18].
Recently, the advent of data mining techniques for discovering usage patterns
from web data (a.k.a. web log mining or web usage mining) made it possible to
mine typical user profiles from the vast amount of access logs. Web usage mining
can be viewed as the extraction of usage patterns from access log data containing
the behavior characteristics of users.
Web usage mining can help in addressing some of the shortcomings of the
standard approaches for web personalization. However, the discovery of patterns
from usage data is not by itself sufficient for performing the personalization tasks.
Nevertheless, it is an important component for effective derivation of ”actionable” user profiles derived from web access patterns [11]. The learned knowledge
can also be used for other applications, such as improving site usability, business
intelligence, and usage characterization.
In this paper, we present initial work on web usage mining, describing the
use of data mining techniques to analyze web log records collected from Publico
On-Line [15], a web newspaper. Using commercial data mining software (SPSS
Clementine [7] and IBM Intelligent Miner [9]), we have identified several web
access patterns by applying association rules induction and clustering techniques
to the access logs of this digital publication.
The remaining of this paper is organized as follows: in section 2, we present
the web usage mining concept. Our access logs processing architecture is presented in section 3. Then, in section 4, we present and analyze the results of the
data mining work on Publico On-Line access data. Finally, section 5 summarizes
our conclusions and presents directions for future work.
2
Mining Web Usage Data
Data mining efforts associated with the Web, called Web Mining, can be broadly
categorized into three areas of interest based on which part of the Web to mine:
web content mining, web structure mining, and web usage mining [10]. Web
content mining focuses on techniques for searching the web for documents whose
content meets web users queries. Web structure mining is related to the analysis
of the link structure of the web, in order to identify relevant documents. Web
usage mining is defined as the process of applying data mining techniques to the
discovery of usage patterns from web logs data, to identify web users behavior
[16]. Web content and structure mining are beyond the scope of this work, despite
we used some sort of structure mining in our preprocessing phase.
In Web mining, data can be collected at the server-side, client-side, proxy
servers, or a consolidated web/business database. In [16], the authors present a
more detailed description of these data sources. To summarize, (i) Web server
logs explicitly records browsing behavior of site visitors, (ii) Client-side data
collection can be implemented by using a remote agent or by modifying the
source code of an existing browser (iii) and Web proxies act as an intermediate
level of caching between client browsers and Web servers.
The information provided by the data sources described above can be used
to construct several data abstractions, namely users, page-views, click-streams,
and server sessions [17]. A user is defined as a single individual that is accessing
file Web servers through a browser. In practice, it is very difficult to uniquely
and repeatedly identify users. A user may access the Web through different
machines, or use more than one browser at one time. A page-view consists of
every file that contributes to the display on a user’s browser at one time and is
usually associated with a single user action such as a mouse-click. A click-stream
is a sequential series of page-views requests. Note that any page view accessed
through a client or proxy-level cache will not be recorded on the server side. A
server session (or visit) is the click-stream for a single user for a particular Web
site. The end of a server session is defined as the point when the user’s browsing
session at that site has ended.
The process of Web usage mining can be divided into three phases: preprocessing, pattern discovery, and pattern analysis [16].
Preprocessing consists of converting usage information contained in the various available data sources into the data abstractions necessary for pattern discovery. Another task is the treatment of outliers, errors, and incomplete data
that can easily occur due reasons inherent to web browsing. The data recorded
in server logs reflects the (possibly concurrent) access of a Web site by multiple
users, and only the IP address, agent, and server side click-stream are available
to identify users and server sessions. However, it is important to notice that the
data collected by server logs may not be entirely reliable because some page
views may be cached by the user’s browser or by a proxy server. In a Web server
log, all requests from a proxy server have the same identifier, even though the requests potentially represent more than one user. Also, due to proxy server level
caching, multiple users throughout an extended period of time could actually
view a single request from the server. The Web server can also store other kinds
of usage information such as cookies, which are markers generated by the Web
server for individual client browsers to automatically track the site visitors.
After each user has been identified (through cookies, logins, or IP/agent
analysis), the click-stream for each user must be divided into sessions. As we
cannot know when the user has left the Web site, a timeout is often used as the
default method of breaking a user’s click-stream into sessions.
The next phase is the pattern discovery phase. Methods and algorithms used
in this phase have been developed from several fields such as statistics, machine
learning, and databases. This phase of Web usage mining has three main operations of interest: association (i.e. which pages tend to be accessed together),
clustering (i.e. finding groups of users, transactions, pages, etc.), and sequential
analysis (the order in which web pages tend to be accessed). The first two are
the focus of our ongoing work.
Pattern analysis is the last phase in the overall process of Web usage mining.
In this phase the motivation is to filter out uninteresting rules or patterns found
in the previous phase. Visualization techniques are useful to help application
domains expert analyze the discovered patterns.
3
Access Logs Processing Architecture
Publico On-Line is a daily online newspaper. Each edition is constructed by a
generation program that collects all articles, applies formats and constructs a
navigable Web structure with articles grouped in thematic sections.
We have defined a general architecture for web access mining (see Figure 1),
using the site’s Web server logs as data source.
The preprocessing phase includes initial preparation tasks that are included
in a processing agent system [13]. This system performs the following tasks:
noise filtering (i.e. removing irrelevant data like access errors or images requests),
Site
Files
Server
Logs
Preparation tasks:
Data Cleaning
User and
Session Identification
FREQUENT ITEM
SETS DISCOVERY
access
repository
Session
Files
Data
Preprocessing
Session Identification
Format transformation
Usage Statistics
STATISTICAL ANALYSIS:
- data distribution estimation
CLUSTERING:
- session clusters
Usage
Mining
WEB ACCESS PATTERNS & USER PROFILES
Fig. 1. Overview of a general architecture for Web Access Mining.
sessions identification, and storage in a repository. Session identification consists
of grouping all page-view records from a given IP address collected during user
activity periods (we define inactivity as a period of 30 minutes or higher for
which we have no registered accesses to the Web server). For each valid pageview (a news article) the agent assigns the corresponding news section based on
site structure information present on the page’s URL. The conceptual schema
of the repository is illustrated in Figure 2. Each article (artigo) is associated to
one section (seccao) and a reader (cliente) accesses one or more articles during
a session.
The sessions identified by the processing agent as described above are called
short sessions. When allowed by the user agent, the web server also registers a
cookie that is accepted by that agent. We call long sessions to the set of short
sessions that share the same cookie (accumulation of the user access transactions
grouped by cookie).
To adapt this data to the data structures of the data mining algorithms
used, we transformed log access tables into numerical and Boolean matrices,
where each column corresponds to a newspaper section and each row represents
a session. In numerical matrices, each matrix cell contains the quantity of articles
accessed on each pair (session, section); in Boolean matrices a cell is True when
at least one article is accessed in that (session, section) pair.
We examined the aggregated data matrices through a set of basic statistical
functions that help in obtaining a preliminary view about the data. For numeric variables we have observed the maximum, minimum, mean, and standard
deviation; for Boolean variables we obtained the frequencies (see Figure 3).
These statistics show that the matrices are very sparse, that is, for each session we have a small number of articles and a small number of sections accessed.
For example, 82.8% of the sessions do not have any accessed articles from the
Fig. 2. Conceptual schema of the access record repository: readers (cliente) access
articles (artigo) during a session (sessao); each article belong to a section (seccao).
Science section, and in the remaining 17.2% we have an average of 2.3 accessed
articles.
Name
Minimum
Value
Maximum
Value
Mean
Standard
Deviation
Na me
Mo dal
Va lue
CIENCIAS
CULTUR A
DES PORTO
ECONOMIA
INTERNACIONAL
LOCAL_LISBOA
LOCAL_PORTO
POLITICA
SOCIEDADE
EDUCACAO
1
1
1
1
1
1
1
1
1
1
97
208
318
258
208
460
256
208
367
90
2.30343
3.78779
5.69846
3.93347
3.38226
5.68833
7.59835
3.35767
4.26733
2.64958
2.81841
5.97421
10.836
7.23418
5.55397
11.5647
13.2351
5.41012
7.9853
3.29088
CIENCIAS
CULTURA
DE SPOR TO
EC ONOMIA
EDUCAC AO
INTERNAC IONAL
LO CAL_LIS BOA
LO CAL_PO RTO
PO LITICA
S OCIEDADE
F
F
F
F
F
F
F
F
F
F
Mo dal
Fre que ncy(%)
8 2.80
8 3.12
6 7.87
7 6.84
8 4.98
6 9.59
7 7.81
8 6.47
7 0.86
7 1.70
Fig. 3. Analysis of short sessions. The dominant values of both the numerical matrix
(left) and the Boolean matrix (right) show that most users access a very small number
of articles. Identical results were obtained for long sessions.
4
Mining Publico On-Line Access Data
To study the identification of associations between sections, we used typical
data mining modelling operations. We view our problem of analyzing patterns
of access to groups of news sections as a Market Basket Analysis problem [4].
Discovery of frequent itemsets is one of the techniques used in this kind of
problem. It’s aim is to find groups of items that are frequently referred together
in transactions. In our problem, transactions are the web accesses and items the
news sections.
4.1
Discovering Frequent Itemsets
Groups of items occurring frequently together in many transactions are referred
to as frequent itemsets [2]. Generally, a support threshold is specified before
mining and is used by the algorithm for pruning the search space. The itemsets
returned by the algorithm satisfy this minimum support threshold.
We have identified frequent sets on Boolean data, defining weak associations
as those below 5% of the total number of occurrences, and heavy associations as
those above 10%. We have chosen these values based on a previous study [6].
Fig. 4. Discovery of frequent sets applied to short sessions. Strong associations have a
heavy line.
Analysis of the results shows that strong associations on short sessions also
exist on long sessions. This is an expected result, as long sessions accumulate
accesses made in short sessions. For example, we have identified strong associations between Politics (Politica) and Society (Sociedade), Politics (Politica) and
International News (Internacional), and between Society (Sociedade) and International News (Internacional), among other strong associations (see Figure 4).
4.2
Clusters Identification
Groups of sections obtained by frequent itemsets analysis gives us some interesting associations. However, it shows ”dependencies” among news sessions independently of the type of users preferences. Identification of groups of users with
identical preferences requires the extraction of different kind of access patterns.
We searched groups of sessions (clusters) that were similar in the sections accessed. We had two approaches for clustering available on the used data mining
tools: demographic clustering (based on Euclidian similarity metrics) and neural
clustering (namely Kohonen self organizing maps) [9].
Figure 5 shows the largest clusters obtained in the analysis of numeric and
Boolean short and long sessions, using both approaches.
Session
type
Measure
data type
Demographic
Clustering
numerical
61%
all sections
except International
short
boolean
numerical
long
boolean
14%
Sports
13%
Internat.
75%
all sections
except Science
13%
Internat.
12%
Sports
Neural
Clustering
15%
Economy, Science
LocalPorto, Educat.
15%
Sports
12%
Internat.
14%
Sports
13%
Internat.
12%
Sports
13%
Internat.
Fig. 5. Largest clusters for short and long sessions in Publico access log data.
Clustering on numerical data shows no evidence of clear reading patterns. We
suspect that we have a very significant number of irregular sessions (outliers)
with sporadic accesses without a defined pattern. This issue will be studied
in future research. Both approaches for clustering Boolean data show similar
reading patterns in short and long sessions. The most frequent clusters are those
that group the accesses to Sports and International sections.
5
Conclusions and Future Work
In this paper we discussed the application of data mining technology to the analysis of access log records collected from a newspaper web site. Using commercial
data mining software systems, we have identified and characterized several reading patterns within the news site. These patterns will define user profiles which
integrate a news recommendation system based on web user preferences.
Frequent sets and clustering produce different patterns. Frequent sets show
groups of sections that are more frequent together, independently of the user profiles, and clustering show groups of sections that define similar web usage. Clustering of Boolean and numerical data lead to different results. While for Boolean
data results are similar in both kinds of sessions and clustering approaches, we
obtained different reading patterns for numerical data, or no reading patterns
at all. We detected that a very significant number of sessions consist of a single page-view referred from a site external to the online newspaper. This may
explain why we were able to identify patterns in Boolean data and had more
difficulty when dealing with numerical data. This suggests that to find more
interesting patterns it is necessary to remove these sessions from the repository.
The clustering results on numerical data may also be an outcome of the Euclidean distance-based similarity measures that are not adequate for mining our
web access data. Previous research indicated that access data to digital libraries
follows a Zipf-like distribution [14]. Commonly used clustering algorithms such
as K-means, were developed for data samples from gaussian populations [3]. As
future work, we plan to study more appropriate methods for analyzing web log
data, using different similarity metrics (Minkowski distances, cosine measure and
extended Jaccard similarity), and taking account the data distribution function.
References
1. Access
Log
Analyzers,
http://www.uu.se/Software/Analyzers/Accessanalyzers.html
2. R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules, Proc. of the
20th VLDB Conference, 1994.
3. P. S. Bradley, U. M. Fayyad, Refining Initial Points for K-Means Clustering, Proc. of
the 15th International Conference on Machine Learning, Morgan Kaufmann, 1998.
4. M. Berry, G. Linoff, Data Mining Techniques - For Marketing, Sales and Customer
Support, John Wiley & Sons, 1997.
5. Brian F. J. Manly, Multivariate Statistical Methods, Chapman & Hall, 1986.
6. P. Batista, M. Silva, Prospecção dos Dados de Acesso a um Servidor de Notı́cias na
Web, 2 Conferência sobre Redes de Computadores, Portugal, Outubro 1999.
7. Clementine User Guide, Version 5, Integral Solutions Limited, 1998.
8. R. Cooley, B. Moshaber, J. Srivastava, Data Preparation for Mining World Wide
Web Browsing Patterns, Knowledge and Information Systems, 1(1), 1999.
9. Using the Intelligent Miner for Data, IBM Corporation, 1998.
10. R. Kosala, H. Blockeel, Web Mining Research: A Survey, SIGKKD Explorations,
2(1), July 2000.
11. B. Mobasher, H. Dai, T. Luo, N. Nakagawa, Y. Sun, J. Wiltshire, Discovery of
Aggregate Usage Profiles for Web Personalization, Proc. of the Web Mining for
E-Commerce Workshop (WebKDD’2000), August 2000.
12. B. Moshaber, R. Cooley, J. Srivastava, Automatic Personalization Based on Web
Usage Mining, Communications of the ACM, 43(8), 2000.
13. N. Maria, P. Gaspar, N. Grilo, A. Ferreira. M. Silva, ARIADNE - Digital Library
Architecture, Proc. of the Second European Conference on Research and Advanced
Technology for Digital Libraries, Springer, 1998.
14. J. E. Pitkow, Summary of WWW Characterization, WWW Journal, 2(1), 1999.
15. Publico On-Line, http://www.publico.pt
16. J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, Web Usage Mining: Discovery
and Applications of Usage Patterns from Web Data, SIGKKD Explorations, 1(2),
Jan 2000.
Committee
Web
Usage
Characterization
Activity,
17. WWW
http://www.w3.org/WCA, Web Characterization Terminology & Definitions
Sheet, W3C Working Draft, May 1999.
18. O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by
Applying OLAP and Data Mining Technology on Web Logs, Proc. of Advances in
Digital Libraries Conference (ADL98), April 1998.