Download A Survey on Preprocessing Techniques in Web Usage Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
COMP 630H
A Survey on Preprocessing Techniques in
Web Usage Mining
Ke Yiping
Student ID: 03997175
Email: [email protected]
Computer Science Department
The Hong Kong University of Science and Technology
Dec 2003
Abstract
The World Wide Web (WWW) continues to grow at an overwhelming rate in both the
sheer volume of traffic and the size and complexity of Web sites. Therefore, it becomes
more and more necessary, but difficult to get useful information from Web data, in order
to understand and better serve the needs of Web-based applications. As a result, the Web
usage mining has become a hot research topic, which combines two of the prominent
research areas comprising the data mining and the World Wide Web. Web usage mining
consists of three phases, namely preprocessing, pattern discovery, and pattern analysis.
For the survey, I focus on the first phase—data preprocessing, which is essential and must
be performed prior to applying data mining algorithms to the data sources. An overview
of data preprocessing techniques aiming at identifying unique users, user sessions and
transactions is presented in this survey.
Key words: Web Usage Mining, Web Log Mining, Data Preprocessing.
1. Introduction and Background
Web Usage Mining is that part of Web Mining which deals with the extraction of useful
knowledge from the secondary data derived from the interactions of the users while
interacting with the Web. The Web usage data includes the data from Web server access
logs, proxy server logs, browser logs, user profiles, registration data, user sessions or
transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, and any
other data as the results of interactions. The scope of Web usage mining is local, which
means that the scope of Web usage mining spans an individual Web site. Usage Mining
tools [2,4,5,6] discover and predict user behavior, in order to help the designer to improve
the web site, to attract visitors, or to give regular users a personalized and adaptive
service. Table 1 gives the explanations of the Web Usage Mining in terms of view of data,
main data, representation, method, mining steps and applications.
As is true for typical data mining applications, the issue of data preprocessing plays
a fundamental role in the whole mining process. As pointed out by [22], this phase is
critical for the success of pattern discovery: errors committed in this phase may make the
data useless for further analysis. In Web usage analysis, data preprocessing includes tasks
of repairing erroneous data and treating missing values [1, 23]. The preprocessing of web
usage data, which is mainly web logs, is usually complex and time consuming. It can take
up to 80% of the time spend analyzing the data.
Tasks for performing preprocessing of Web Usage Mining in [1] involve data
-2-
cleaning, user identification, session identification, path completion, session
reconstruction, transaction identification and formatting, etc. And some other tasks, such
as conceptual hierarchies construction and classification of URLs, as stated in [2, 3], may
be invoked in certain cases. The aim of preprocessing is to offer structural, reliable and
integrated data source to next one, pattern discovery. The typical problem is
distinguishing among unique users, user sessions, transactions, etc.
View of Data
Main Data
Representation
Method
Mining Steps
Applications
Table 1: Web Usage Mining
- Interactivity
- Server logs
- Browser logs
- Relational table
- Graph
- Machine Learning
- Statistical
- (Modified) association rules
- Data preprocessing
- Pattern discovery
- Pattern analysis
- Site construction, adaptation, and management
- Marketing
- User modeling
The rest of this survey will present the tasks and techniques of Web usage mining
preprocessing in details at section 2. Section 3 describes some problems in preprocessing.
And then come to the conclusion in section 4.
2. Preprocessing Tasks and Techniques
2.1 Data Cleaning
This step is to remove all the data useless for data analyzing and mining e.g.: requests for
graphical page content (e.g., jpg and gif images); requests for any other file which might
be included into a web page; or even navigation sessions performed by robots and web
spiders. The quality of the final results strongly depends on cleaning process. Appropriate
cleaning of the data set has profound effects on the performance of web usage mining.
The discovered associations or reported statistics are only useful if the data represented in
the server log gives an accurate picture of the user accesses to the Web site.
The procedures of general data cleaning are as follows:
• Firstly, it should remove entries that have status of “error” or “failure”. It’s to
-3-
remove the noisy data from the data set. To accomplish it is quite easy.
• Secondly, some access records generated by automatic search engine agent
should be identified and removed from the access log. Primarily, it should
identify log entries created by so-called crawlers or spiders that are used widely
in Web search engine tools. Such data offer nothing to the analyzing of user
navigation behaviors.
Many crawlers voluntarily declare themselves in agent field of access log, so a
simple string match during the data cleaning phase can strip off a significant
amount of agent traffic. In addition, to exclude these accesses, [2] employs
several heuristic methods that are based on indicators of non-human behavior.
These indicators are:
(i) The repeated request for the same URL from the same host.
(ii) A time interval between requests too short to apprehend the contents of a
page.
(iii) A series of requests from one host all of those referred URLs are empty.
The referred URL of a request is empty if the URL was typed in,
requested using a bookmark, or requested using a script.
• The last task of data cleaning, which is also disputable, is whether it needs to
remove log entries covering image, audio and video files. In some system, such
as WebMiner, all such entries are removed, while in WebLogMiner, such data are
held. Generally, WUM tools applied to user navigation patterns intend to remove
all such entries, because they do minor influence on the analysis of users’
navigation behaviors even such entries are generated by clicking of users.
Typical web log cleaning methodologies mainly aim at removing image and picture
files with extensions GIF and JPG (if analysis does not involve image examining). The
analysis concerns the investigation of media/multimedia files but there are masses of
other irrelevant files, which stay untouched during all the analysis process or even have a
negative impact on the analysis. These can be internal web administrator actions, special
purpose files, etc. The complexity of many web sites can influence the data cleaning
process and impact the final results as well.
In contrast to the typical data cleaning techniques, [9] introduces a novel advanced
cleaning technique. At the case stated in [9], the frame pages do not have characteristic
file extension (e.g., like pictures extensions jpg, gif). After applying the typical data
cleaning techniques, the results were not meaningful to the client. Therefore, in the paper,
the author proposes technique consist of two frame recognition stages: retrieving HTML
code from web server and the filtering. Improved filtering removes pages with no links
from other pages. Using this methodology, cleaning covers not just frame pages but all
irrelevant pages and leaves only essential pages for further analysis and experimental
studies.
While requests for graphical contents and files are easy to eliminate, robots’ and web
spiders’ navigation patterns must be explicitly identified. This is usually done by referring
-4-
to the remote hostname, by referring to the user agent, or by checking the access to the
robots.txt file. However, some robots actually send a false user agent in HTTP request. In
these cases, [20] proposes an approach to separate robot sessions from actual users’
sessions using standard classification techniques. Highly accurate models to achieve this
goal can be obtained after three requests using a small set of access features computed
from the Web server logs. A heuristic based on navigational behavior can also be used to
achieve this goal (see [24]).
2.2 User Identification
This task is greatly complicated by the existence of local caches, corporate firewalls, and
proxy servers. The data recorded by a Web server are not sufficient for distinguishing
among different users and for distinguishing among multiple visits of the same person.
The standard Common Logfile Format (CLF), as well as Extended CLF (ECLF), only
records host or proxy IP from which requests originate, so different visitors sharing the
same host or one proxy server can not be distinguished.
Methods such as cookies, user registration or the remote agent, a cookie-like
mechanism used in [18] make the identification of a visitor possible. The shortcomings of
such methods are that they rely on user’s cooperation, but user often denies providing
such cooperation due to privacy concerns or thinking it is troublesome.
In [1], the author provides some heuristics for user identification. The first heuristic
states two accesses having the same IP but different browser (versions) or operation
system, which are both recorded in agent field, are originated from two different users.
The rational behind this heuristic is that a user, when navigating the web site, rarely
employs more than one browser, much more than one OS. But this method will render
confusion when a visitor actually does like that. The second heuristic states that when a
web page requested is not reachable by a hyperlink from any previously visited pages,
there is another user with the same IP address. But such method introduces the similar
confusion when ser types URL directly or uses bookmark to reach pages not connected
via links.
Another main obstacle to user identification is the use of proxy servers. Use of a
machine name to uniquely identify users can result in several users being erroneously
grouped together as one user. An algorithm presented in [25] checks to see if each
incoming request is reachable from the pages already visited. If a page is requested that is
not directly linked to the previous pages, multiple users are assumed to exist on the same
machine. In [17], user session lengths determined automatically based on navigation
patterns are used to identify users. Other heuristics involve using a combination of IP
address, machine name, browser agent, and temporal information to identify users [26].
-5-
2.3 User Session Definition
User session here is the set of consecutive pages visited by a single user at a defined
duration. For logs that span long periods of time, it is very likely that users will visit the
Web site more than once. The goal of session identification is to divide the page accesses
of each user into individual sessions. Similar to user identification, Cookie and Session
mechanisms both can be used in the session identification, yet same problems also exit.
To define user session, two criteria are usually considered:
1. Upper limit of the session duration as a whole;
2. Upper limit on the time spent visiting a page.
Generally the second method is more practical and has been used in WUM and
WebMiner. It is achieved through a timeout, where if the time between page requests
exceeds a certain limit, it is assumed that the user is starting a new session. Many
commercial products use 30 minutes as a default timeout, and [27] established a timeout
of 25.5 minutes based on empirical data. Once a site log has been analyzed and usage
statistics obtained, a timeout that is appropriate for the specific Web site can be fed back
into the session identification algorithm.
2.4 Path Completion
In order to reliably identify unique user session, it should determine if there are important
accesses that are not recorded in the access log. The reason why causes such matter is
mainly because of the presence of Cache. Mechanisms such as local caches and proxy
servers can severely distort the overall picture of user traversals through a Web site. If
user clicks “backward” to visit a page that has had a copy stored in Cache, browser will
get the page directly from Cache. Such a page view will never be trailed in access log,
thus causing the problem of incomplete path, which need mending.
Current methods to try to overcome this problem include the use of cookies, cache
busting, and explicit user registration. As detailed in [26], none of these methods are
without serious drawbacks. Cookies can be deleted by the user, cache busting defeats the
speed advantage that caching was created to provide and can be disabled, and user
registration is voluntary and users often provide false information.
Methods similar to those used for user identification can be user for path completion.
To accomplish this task needs to refer to referrer log and site topology, along with
temporal information to infer missing references. Of the referred URL of a requesting
page does not exactly match the last direct page requested, it means that the requested
path is not complete. Furthermore, if the referred page URL is in the user’s recent request
history, we can assume that the user has clicked the “backward” button to visit page. But
if the referred page is not in the history, it means that a new user session begins, just as
we have stated above. We can mend the incomplete path using heuristics provided by
referrer and site topology.
-6-
2.5 User Session Reconstruction
This step is to reconstructing the faithful users’ navigation path within the identified
sessions. In particular, errors in the reconstruction of sessions and incomplete tracing of
users’ activities in a site can easily result in invalid patterns and wrong conclusions.
By using additional information about the web site structure is still possible to
reconstruct a consistent path by means of heuristics. Heuristic methods for session
reconstruction must fulfill two tasks: First, all activities performed by the same physical
person should be grouped together. Second, all activities belonging to the same visit
should be placed into the same group. Knowledge about a user’s identity is not necessary
to fulfill these tasks. However, a mechanism for distinguishing among different users is
indeed needed.
For the achievement of session reconstruction, there are several “proactive” and
“reactive” strategies.
• Proactive strategies include user authentication, the activation of cookies that are
installed at the user’s work area and can thus attach a unique identifier to her
requests, as well as the replacement of a site’s static pages with dynamically
generated pages that are uniquely associated with the browser that invokes them.
These strategies aim at an unambiguous association of each request with an
individual before or during the individual’s interaction with the Web site.
• Reactive strategies exploit background knowledge on user navigational behavior
to assess whether requests registered by the Web server can belong to the same
individual, and whether these requests were performed during the same or
subsequent visits of the individual to the site. These strategies attempt to
associate requests to individuals after the interaction with the Web site, based on
the existing, incomplete records.
[7] evaluates the performance of heuristics employed to reconstruct sessions from the
server log data. Such heuristics are called to partition activities first by user and then by
visit of the user in the site, where user identification mechanisms, such as cookies, may or
may not be available.
2.6 Transaction Identification
Before any mining is done on Web usage data, sequences of page references must be
grouped into logical units representing Web transactions. A transaction differs from a user
session in that the size of a transaction can range from a single page reference to all of the
page references in a user session, depending on the criteria used to identify transactions.
Unlike traditional domains for data mining, such as point of sale databases, there is no
convenient method of clustering page references into transactions smaller than an entire
user session. This problem has been addressed in [28] and [17].
How to define transaction depends on what kind of knowledge we want to mine. In
-7-
WUM [2, 3], user session mentioned above as user session based on duration and
transaction here as user session based on structure. As WUM aims at analyzing
navigation patterns of users, so it is reasonable to take session as transaction. However,
WebMiner [1, 17] concentrates on association rule and sequential pattern mining, so it
tends to divide user session into transaction, which is set of semantically related pages.
In order to divide user session into transaction, WebMiner classifies pages in a session
into auxiliary pages and content pages. Auxiliary pages are those that are just to facilitate
the browsing of a user while searching for information. Content pages are those that user
are of interest and that they really want to reach. Using the concept of auxiliary and
content page references, there are two ways to define transactions. The first would be to
define a transaction as all the auxiliary references up to and including each content
reference for a given user, which is a so-called auxiliary-content transaction as all of the
transaction. The second method would be to define a transaction as all of the content
references for a given user, called content-only transactions. Based on the two definitions,
WebMiner employs two methods to identify transaction: one is reference length; the other
is maximal forward reference. It also uses time window as a benchmark to evaluate the
two methods [1, 17, 16].
2.7 Formatting
Once the appropriate preprocessing steps have been applied to the server log, a final
preparation module can be used to properly format the sessions or transactions for the
type of data mining to be accomplished. For example, since temporal information is not
needed for the mining of association rules, a final association rule preparation module
would strip out the time for each reference, and do any other formatting of the data
necessary for the specific data mining algorithm to be used.
[29] stores data extracted from web logs into a relational database using a click fact
schema, so as to provide better support to log querying finalized to frequent pattern
mining. [30] introduces a method based on signature tree to index log stored in databases
for efficient pattern queries. A tree structure, WAP-tree, is also introduced in [51] to
register access sequence to web pages. The author also does the optimization of this
structure to exploit the sequence mining algorithm.
3. Problems in Preprocessing
Clearly improved data quality can improve the quality of any analysis on it. Data
quality is a source of major difficulties for Web usage mining, and particularly for
the sessionizing problem. The solution of a dedicated server recording all
activities of each user individually was put forward as a tested and successful
-8-
solution. In the absence of such a server, cookies and scripts can be used to
distinguish among unique users. However, they are not always feasible or popular,
due to privacy considerations. A problem in the Web domain is the inherent
conflict between the analysis needs of the analysts (who want more detailed
usage data collected), and the privacy needs of users (who want as little data
collected as possible). In addition caching and proxy servers can make
reconstructing reliable sessions difficult.
4. Conclusion
In this survey, I address a typical sequence of tasks of Web usage mining preprocessing.
The techniques and methods used in each individual task are also presented in great
details. One important thing I want to point out is that each task relies heavily on each
other. In practical application, some of the tasks are carried out together and do not
distinguish with each other very clearly. Moreover, for specific mining applications, the
procedures of preprocessing may be of little variation. As for the application of Web
personalization [14], the preprocessing steps may include data selection, cleaning and
transformation and the identification of users and user sessions.
Acknowledgments
I would like to express my gratefulness to Dr. Wilfred Ng for guiding this survey. I also
thank Lu An for the interesting discussions.
References
[1] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data Preparation for
Mining World Wide Web Browsing Patterns (1999), Knowledge and Information Systems
V1(1).
[2] B. Berendt, M. Spiliopoulou. Analyzing navigation behavior in Web sites integrating
multiple information systems. VLDBJournal, Special Issue on Databases and the Web 9,
1 (2000), 56-75.
[3] M. Spiliopoulou. Web usage mining for Web site evaluation. Communications of the
-9-
ACM Volume 43, Number 8 (2000), Pages 127-134.
[4] R. Cooley, M. Deshpande, J. Srivastava, and P.N. Tan. Web usage mining: Discovery
and applications of usage patterns from web data. ACM SIGKDD Explorations, 1(2),
January 2000.
[5] Kdnuggets. Software for web mining. http://www.kdnuggets.com/-software/web.html.
[6] M. Spiliopoulou and L.C. Faulstich. WUM: a Web Utilization Miner. In Proceedings
of the EDBT Workshop WebDB98, volume 1590 of LNCS, pages 109-115, 1998.
[7] B. Berendt, B. Mobasher, M. Spiliopoulou, and M. Nakagawa. A framework for the
evaluation of session reconstruction heuristics in web usage analysis. INFORMS Journal
of Computing, 15(2), 2003.
[8] L.C. Faulstich, et al. WUM: A Tool for Web Utilization Analysis. In EDBT Workshop
WebDB’98. 1998, Spain.
[9] Z. Pabarskaite. Implementing Advanced Cleaning and End-User Interpretability
Technologies in Web Log Mining. In Proceedings of the 24th International Conference ITI
2002.
[10] A. Abraham and Xiaozhe Wang. i-miner: A web usage mining framework using
hierarchical intelligent systems. In The IEEE International Conference on Fuzzy Systems
FUZZ-IEEE'03, 2003.
[11] T. Ohmori, Y. Tsutatani, M. Hoshi. A Novel Datacube Model Supporting Interactive
Web-log Mining. In Proceedings First International Symposium. pp. 419-427. Cyber
Worlds, 2002.
[12] R. Kosala, H. Blockeel. Web Mining Research: A Survey. SIGKDD Explorations:
Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining,
ACM, 2(1), 2000.
[13] B. Masand and M. Spiliopoulou. Webkdd-99: Workshop on web usage analysis and
user profiling. SIGKDD Explorations, 1(2), 2000.
[14] M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri, and F. Turini. Preprocessing and
Mining Web Log Data for Web Personalization. The 8th Italian Conf. on Artificial
Intelligence : 237-249. Vol. 2829 of LNCS, September 2003.
- 10 -
[15] R. Cooley, B. Mobasher, and J. Srivastava. Web Mining: Information and Pattern
Discovery on the World Wide Web. In Proceedings of the 9th IEEE International
Conference on Tools With Artificial Intelligence (ICTAI ’97), Newport Beach, CA,
1997.
[16] R. Cooley, B. mobasher, and J. Srivastava. Data Preparation for Mining World Wide
Web Browsing Patterns. Knowledge and Information Systems, Volume 1, No. 1, 1999.
[17] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Grouping Web Page
References into Transactions for Mining World Wide Web Browsing Patterns (1997), in
Proceedings of the 1997 IEEE Knowledge and Data Engineering Exchange Workshop
(KDEX-97), November 1997.
[18] C. Shahabi, A. M. Zarkesh, J. Adibi, V. Shah. Knowledge discovery from users
Web-page navigation. Research Issues in Data Engineering, 1997. Proceedings. Seventh
International Workshop on , 7-8 April 1997, Page(s): 20 –29
[19] Federico Michele Facca and Pier Luca Lanzi. Recent Developments in Web Usage
Mining Research. Data Warehousing and Knowledge Discovery: 5th International
Conference, DaWaK 2003, Pages(s): 140-150.
[20] Pang-Ning Tan, and Vipin Kumar. Modeling of Web Robot Navigational Patterns. In
WebKDD 2000 – Web Mining for E-commerce – Challenges and Opportnities, Second
International Workshop, August 2000.
[21] Ron Kohavi, et al. Lessons and Challenges from Mining Retail E-Commerce Data.
Journal of Machine Learning, 2003.
[22] D. Pyle, 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc.,
Dan Francisco, CA.
[23] R. Cooley, P. Tan, J. Srivastava. 2000. Discovery of interesting usage patterns from
Web data. B. Masand, M. Spiliopoulou, eds. Advances in Web Usage Analysis and User
profiling. LNAI 1836, Springer, Berlin, Germany. 163-182.
[24] Pang-Ning Tan, and Vipin Kumar. Discovery of web robot sessions based on their
navigational patterns. Data Mining and Knowledge Discovery, 6(1): 9-35, 2002.
[25] P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow’s ear: Extracting usable structures
from the web. In Proceedings of 1996 Conference on Human Factors in Computing
- 11 -
Systems (CHI-96), Vancouver, British Columbia, Canada, 1996.
[26] J. Pitkow. In search of reliable usage data on the www. In Sixth International World
Wide Web Conference, pages 451-463, Santa Clara, CA, 1997.
[27] L. Catledge, J. Pitkow. Characterizing browsing behaviors on the World Wide Web,
Computer Networks and ISDN Systems 27(6), 1995, pp. 1065-1073.
[28] M.S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web
environment. In Proceedings of the 16th International Conference on Distributed
Computing Systems, pages 385-392, 1996.
[29] Jesper Andersen, Anders Giversen, Allan H. Jensen, Rune S. Larsen, Torben Bach
Pedersen, and Janne Skyt. Analyzing clickstreams using subsessions. In International
Workshop on Data Warehousing and OLAP (DOLAP 2000), 2000.
[30] Alexandros Nanopoulos, Maciej Zakrzewicz, Tadeusz Morzy, and Yannis
Manolopoulos. Indexing web access-logs for pattern queries. In Fourth ACM CIKM
International Workshop on Web Information and Data Management (WIDM’02), 2002.
[31] Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu. Mining access patterns
efficiently from web logs. In Pacific-Asia Conference on Knowledge Discovery and Data
Mining, pages 396-407, 2000.
- 12 -