Download Analysis of Web Usage Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Analysis of Web Usage Mining
Hui Yu, Zhongmin Lu
School of Management,
South –Central University for Nationalities, Wuhan, P.R.China, 430074
Business School
Wuhan Institute of Technology, Wuhan, P.R.China, 430074
Abstract
Web usage mining is an effective means to analyze the web usage data and understand the needs of
Web-based applications, after analyzed the data, hidden customers can be found, the hyperlink of web
structure can be optimized and adaptive web sites can be created, and so on. The paper presents an
overview of academic research in web mining and then focuses on the web usage mining and related
technologies and tools to study on line consumer behaviour.
Key words: data mining; Web mining; Web usage mining
1 Introduction
The World Wide Web is one of the most heavily used services of the Internet. It is used by a variety of
different communities to meet all kinds of information needs (Wagner, 2001). For example, Students
may use the Web to find out more about their University, certain departments or particular lectures and
projects. Another example in the academic context is teams involved in research that use the Web to
find out about related work. Some companies use the Web to find out more about their competition;
some individuals use it to keep up to date with the latest news, according to their interests.
However, it has been suggested that most designers don't consider tracking web site traffic even after
considerable time, effort and perhaps money, has been spent on designing the site and creating the
content. Potentially useful information such as who comes to their site, where he or she comes from,
when he or she comes, how she or he finds their site is ignored or lost. Individuals themselves often find
it difficult to recollect their own personal history of Web navigation. The benefits provided by an
automated system can be seen to have enormous advantages and potential.
To provide some answers and insight into this area, the paper will introduce some web usage mining
technologies. In order to clarify Web usage mining technologies, Web mining will be analysed as
following:
2 What is Web Mining?
Web mining is a natural combination of data mining and the WWW; it can be broadly defined as the
discovery and analysis of useful information from the World Wide Web (Cooley, 1997).
Etzioni (1996) suggests decomposing Web mining into the following subtasks:
a.Resource Discovery: the task of retrieving the intended information from Web.
b.Information Extraction: automatically selecting and pre-processing specific information from the
retrieved Web resources.
c.Generalisation: automatically discovers general patters at the both individual Web sites and across
multiple sites.
d. Analysis: analysing the mined patterns.
Madria, Bhowmick, Ng, and Lim (1999) claim the Web involves three types of data: data on the Web
(content), Web log data (usage) and Web structure data. Based on an emphasis to obtain information,
web mining can be divided into three major parts: Web Contents Mining, Web Structure Mining, and
Web Usage Mining.
• Web Content Mining
Web content mining describes the automatic search of information resource available online. In the Web
mining domain, Web content mining essentially is an analog of data mining techniques for relational
databases, since it is possible to find similar types of knowledge from the unstructured data residing in
1291
Web documents (Wang, 2000). The Web document usually contains several types of data, such as text,
image, audio, video, metadata and hyperlinks. Some of them are semi- structured such as HTML
documents, or a more structured data like the data in the tables or database generated HTML pages, but
most of the data is unstructured text data. The unstructured characteristic of Web data force the Web
content mining towards a more complicated approach.
• Web Structure Mining
Most of the Web information retrieval tools only use the textual information, while ignoring the link
information that could be very valuable. The goal of Web structure mining is to generate structural
summary about the Web site and Web page.
Web structure mining can also have another direction – discovering the structure of Web document
itself (Madria, Bhowmick, Ng, and Lim, 1999). This type of structure mining can be used to reveal the
structure (schema) of Web pages. This would be good for navigational purposes and make it possible to
compare/integrate Web page schemes.
Another task of Web structure mining is to discover the nature of the hierarchy or network of hyperlinks
in the Web sites of a particular domain. This may help to generalize the flow of information in Web sites
that may represent some particular domain, therefore the query processing will be easier and more
efficient.
Web structure mining has a natural relation with Web content mining since it is very likely that the Web
documents contain links, and they both use the real or primary data on the Web. It’s quite often to
combine these two mining tasks in an application. (Wang, 2000).
• Web Usage Mining
Web Usage Mining tries to discover the useful information from the secondary data derived from the
interactions of the users while surfing on the Web. It focuses on the techniques that could predict user
behavior while the user interacts with Web. Spiliopoulou (1999) suggested the potential strategic aims
in each domain into mining goals are Prediction of the user’s behavior within the site. Comparison
between expected and actual Web site usage. Adjustment of the Web site to the interests of its users.
And in the process of data preparation of Web usage mining, the Web content and Web structure will be
used as the information sources.
3 Why Web Usage Mining?
In this paper emphasis is placed on Web usage mining. The reasons are very simple: With the explosion
of E-commerce the way in which companies are doing businesses has changed. E-commerce, mainly
characterised by electronic transactions through Internet, has provided a cost-efficient and effective way
of doing business. Unfortunately, to most companies the web is seemingly nothing more than a
mysterious place where transactions take place. They perhaps do not realise that as millions of visitors
interact daily with Web sites around the world, massive amounts of data are being generated. And it is
arguable that with the exception of the major players in electronic commerce, most businesses do not
realise the value that this information could be to the company in the fields of understanding customer
behaviour, improving customer services and relationship, launching target marketing campaigns and
measuring the success of marketing efforts.
4 How to Perform Web Usage Mining
Web usage mining is achieved by reporting visitors traffic information based on web server log files.
Web server log files were used initially by web designers or system administrators for the purposes of
“how much traffic they are getting, how many requests fail, and what kind of errors is being generated”.
However, web server log files can also record and trace the visitors’ on-line behaviours (Cooley, 1997).
For example, after some basic traffic analysis, the log files can help designers or administrators to
answer questions such as “from what search engine are visitors coming? What pages are the most
popular? Which browsers and operating systems are most commonly used by visitors?”
1292
After the Web traffic data is obtained, it may be combined with relational databases. Through some
data mining techniques such as association rules, path analysis, and so on, visitors’ behaviour patterns
can be found and interpreted.
The above is the brief explanation of how web usage mining is done. Most sophisticated systems and
techniques are parsed into three distinctive processes: pre-processing, pattern discovery, and pattern
analysis (Galeas, 2001). Every process can be categorised as follows (see figure1):
WEB USAGE MINING
Data Pre-processing
Pattern Discovery Tools
Converting IP addresses to Domain Names
Converting File Names to Page Titles
Statistical Analysis
Association rules
…
Pattern Analysis Tools
Visualisation Techniques
Web transactions OLAP
Techniques
Data & Knowledge Querying
Usability Analysis
Figure 1 Research Areas in Web Usage Mining (Galeas, 2001)
4.1 Data Pre-processing
Data preparation involves finding the answers to a number of questions such as: “How do you mine data
that is not in the right form? How do you handle data that is not entirely clean?” To indicate where
possible solutions may be found Groth (1999) suggests a number of factors to be considered. These
are given as:
• Data is not always clean. For example, a column containing a list of soft drinks may have the values
“Pepsi”, “Pepsi Cola”, and “Cola”. The values refer to the same drink, but are not known to the
computer as being one and the same. This is a consistency problem.
• Another cleaning issue is stable data. Mailing lists have to be continually updated because people
move and their addresses change. An old address that is no longer correct is often referred to as
stale.
• Another Data-cleaning issue is typographical errors. Words are frequently misspelled or typed
incorrectly.
This information needs to be integrated to form a complete data set for data mining. However, before
the integration of the data, web log files need to be cleaned/filtered, using techniques from filtering the
raw data to eliminate irrelevant items, grouping individual page accesses into logic units.
It is important for web traffic analysis derived from filtering the raw data to be able to eliminate
irrelevant items. Mobasher (1997) suggests that elimination of irrelevant items can be accomplished by
checking the suffix of the URL name. For example, the embedded graphics can be filtered out from the
web log file, whose suffix is usually the form of “gif”, “jpeg”, “jpg”, “GIF”, “JPEG”, “JPG”, can be
removed.
Before any mining is done on Web usage data, sequences of page references must be grouped into
logical units representing Web transactions or user sessions (Kizhakke, 2000). A user session consists of
all the page references made by a user during a single visit to a site. Identifying user sessions is similar
to the problem of identifying individual users. Hence, assumptions are made depending on the final
purpose of the mining process.
1293
4.2 Pattern Discovery
This is the key component of the Web mining (Wang, 2000). Pattern Discovery Tools implement
techniques from data mining, psychology, and information theory on the Web traffic data collected.
Once user sessions have been identified, several types of access pattern mining can be performed
depending on the needs of the analyst. Some of these discovery techniques are:
• Converting IP Addresses to Domain Names
Every visitor to a Web site connects to the Internet through an IP address (for example, 157.228.102.1).
Every IP address has a corresponding domain name (for example, osiris.sunderland.ac.uk), and these are
linked through the Domain Name System (DNS). DNS can convert a domain name that a visitor entered
in Web browser into a corresponding IP address.
When converting the IP number into the domain name, some knowledge can be
discovered. For example, it can be estimated where visitors live by looking at the
extension of each visitor’s domain name, such as .ca (Canada); .au (Australia);
cn(China), etc.
•
Converting File Names to Page Titles
A well-designed site will have a title (between <title> and </title>) for every page. Rather than
simply report the file names (URL) requested a good system is able to look at these files and
thereby determine their titles. Page titles are much easier to read than URLs so it can perhaps
be inferred that a good system should show page titles on reports in addition to URLs.
• Statistical Analysis
Statistical techniques are the most powerful tools in extracting knowledge about visitors to a Web site
(Srivastava, 2000). The analysts may perform different kinds of descriptive statistical analyses based on
different variables (such as page views, viewing time and length of a navigational path) when analysing
the session file. By analysing the statistical information (such as the most frequently accessed pages,
average view time of a page or average length of a path through a site) contained in the periodic Web
system report, the extracted report can be potentially useful for improving the system performance,
enhancing the security of the system, facilitating the site modification task, and providing support for
marketing decisions.
• Association Rules
Srivastava (2000) pointed that in the context of the Web usage mining, the association rules refer to sets
of pages that are accessed together with a support value exceeding some specified threshold. The Web
designers can restructure their Web sites efficiently with the help of the presence or absence of the
association rules. When loading a page from a remote site, association rules can be used as a trigger for
perfecting documents to reduce user perceived latency.
• Clustering
Clustering is a technique to group together a set of items having similar characteristics. In the Web
Usage domain, there are two kinds of interesting clusters to be discovered: usage clusters and pages
clusters. Clustering of users tends to establish groups of users exhibiting similar browsing patterns. On
the other hand, clustering of pages will discover groups of pages having related content. This
information is useful for Internet search engines and Web assistance providers. (Cooley, 2000).
• Classification
Classification is the technique to map a data item into one of several predefined classes. In the Web
domain, Web master will have to use this technique if he/she wants to establish a profile of users
belonging to a particular class or category. This requires extraction and selection of features that best
describe the properties of a given class or category. The classification can be done by using supervised
inductive learning algorithms such as decision tree classifiers, k-nearest neighbour classifier, Support
Vector Machines etc. (Srivastava, 2000).
• Path Analysis
Graph models are most commonly used for Path Analysis. In the graph models, a graph represents some
relation defined on Web pages (or web), and each tree of the graph represents a web site. Each node in
the tree represents a web page (html document), and edges between trees represent the links between
1294
web sites, while the edges between nodes inside a same tree represent links between documents at a web
site.
When path analysis is used on the site as a whole, this information can offer valuable insights about
navigational behaviours.
• Sequential Patterns
This technique enables the finding of inter-session patterns, so that a set of the items follows the
presence of another’s in a time-ordered set of sessions or episodes. It is very helpful for the Web
marketer to be able -up to a point – to predict future trends, which can help to place advertisements
aimed at certain user groups. Sequential patterns also include some other types of temporal analysis such
as trend analysis, change point detection, or similarity analysis (Cooley, 2000).
4.3 Pattern Analysis
Pattern Analysis is a final stage of the whole Web usage mining. The goal of this process is to eliminate
irrelevant (or unwanted) rules or patterns and to extract the interesting rules or patterns from the output
of the pattern discovery process (Wang, 2000). The output of earlier stage web usage mining is often not
suitable for the web site administrators. The type of information sought in this respect is “How are
people using the site? Which pages are being accessed most frequently?” This type of question requires
the analysis of the structure of hyperlinks as well as the contents of the pages. This can be done with the
help of some analysis methodologies and tools. The common techniques used for pattern analysis are
visualisation techniques, OLAP techniques, Data & Knowledge Querying, and Usability Analysis.
These are commented on in more detail below:
• Visualisation Techniques
Visualisation has been used very successfully in helping people understand various types of phenomena,
both real and abstract. Hence it is a natural choice for understanding the behaviour of web users. Groth
(1999) argues that visualisation is simply the graphical presentation of data.
• OLAP Techniques
On-line Analytical Processing (OLAP) is emerging as a very powerful paradigm for strategic analysis of
databases in business settings. Some of the key characteristics of strategic analysis include: very large
data volume, explicit support for the temporal dimension, support for various kinds of information
aggregation, and long-range analysis in which overall trends are more important than details of
individual data items.
While OLAP can be performed directly on top of relational databases, industry has developed
specialised tools to make it more efficient and effective. (Information Advantage, 1997).
• Data and Knowledge Querying
One of the reasons attributed to the great success of relational database technology has been the
existence of a high-level, declarative, query language, which allows an application to express what
conditions must be satisfied by the data it needs, rather than having to specify how to get the required
data.
Given the large number of patterns that may be mined, there appears to be a definite need for a
mechanism to specify the focus of the analysis. First, constraints may be placed on the database to
restrict the portion of the database from which to mine for. Second, querying may be performed on the
knowledge that has been extracted by the mining process, in which case a language for querying
knowledge rather than data is needed (Mobasher, 1997a).
• Usability Analysis
The first step undertaken in this method is to develop instrumentation methods that collect data about
software usability. This data is then used to build computerised models and simulations that explain the
data. Finally, various data presentation and visualisation techniques are used to help an analyst
understand the phenomenon. This approach can also be used to model the browsing behaviour of users
on the web (Mobasher, 1997b).
However, as most of those techniques are disliked by users because of slow speeds, inflexibility,
difficult to maintain and limited functionality. To develop a more efficient, flexible and powerful set of
tools to undertake this task there still remains a lot of work to be undertaken by both researcher and
developer.
1295
5 Conclusion
There are lots of researcher are studying Web usage mining, but few of them make great progress. The
paper have been examined and discussed Web usage mining, a detailed description of the three phases
of the Web usage mining process has been provided and commented on. Due to the massive growth of
the e-commerce industry and associated spin-offs, privacy issues have arguably become one of the most
critical concerns between the Web user and e-commerce developer; our future work will focus on this
aspect, and also include using web usage mining data to create adaptive electronic commerce web sites.
References
[1] Accrue Software Inc (2000) Web Mining White paper: Driving Business Decisions in Web Time,
www.accrue.com.
[2] Cooley, Bamshad and Jaideep (1997) Web Mining: Information and Pattern Discovery on the
World Wide Web, http://wwwusers.cs.umn.edu/~mobasher/webminer/survey/survey.html.
[3] Cooley, R. (2000) Web Usage Mining: Discovery and Application of Interesting Patterns from
Web data. http://citeseer.nj.nec.com/426030.html.
[4] Etzioni, O. (1996) The World Wide Web: Quagmire or Gold Mine, Communications of the ACM,
volume36, number 11 (November), pp. 65-68.
[5] Galeas, P. (2001) Web Mining, http://www.galeas.de/webmining.html.
[6] Groth, R. (1999) Data mining: building competitive advantage, Vanessa Moore, USA. Pp47.
[7] Information Advantage (1997) Decision suite users guide: Online Analytical Processing,
http://www-users.cs.york.ac.uk/~kimble/research/ak/vendors.htm.
[8] kizhakke,V.P.( 2000) MIR: A Tool For Visual Presentation of WEB Access Behaviour
http://citeseer.nj.nec.com/cache/papers/cs/20450/ kizhakke00mir.pdf.
[9] Madria, S. K., Bhowmick, S. S, Ng, W.K., and Lim, E. P. (1999) Research issues in Web data
mining.In Proceedings of Data Warehousing and Knowledge Discovery, First International
Conference, DaWaK ’99, pages 303-312.
[10] Mobasher, B. (1997a) Data & Knowledge Querying
http://www-users.cs.umn.edu/~mobasher/webminer/survey/node21.html.
[11] Mobasher, B.(1997b) Usability Analysis
http://www-users.cs.umn.edu/~mobasher/webminer/survey/node22.html.
[12] Spiliopoulou, M. (1999) Data mining for the Web. In Proceedings of Principles of Data Mining
and Knowledge Discovery, Third European conference, PKDD’99, P588-589
[13] Srivastava, J. (2000) Web Usage Mining: Discovery and Applications of Usage Patterns from Web
Data http://www.acm.org/sigkdd/explorations/issue1-2/srivastava.pdf.
[14] Wagner,H.(2001), Towards an Integrated Approach to Collaborative Web Usage
http://www.pms.informatik.uni-muenchen.de/lehre/projekt-diplom-arbeit/navigation-track/doc/the
sis.shtml.
[15] Wang,Y.(2000) Web Mining and Knowledge Discovery of Usage Patterns,
http://db.uwaterloo.ca/~tozsu/courses/cs748t/surveys/wang.pdf.
1296