Download Chapter - III Web Data Mining Analysis 3.1 Data Mining Data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter - III
Web Data Mining Analysis
3.1 Data Mining
Data mining has an important place in today’s world. It becomes an
important research area as there is a huge amount of data available in most of
the applications. This huge amount of data must be processed in order to
extract useful information and knowledge, since they are not explicit. Data
Mining is the process of discovering interesting knowledge from large
amount of data1.
Figure 3.1 Phases of Data Mining
52
The data source for data mining can be different types of databases
such as text files or other types of files including different kinds of data.
Data mining is an interdisciplinary research field related to database systems,
statistics, machine learning, information retrieval etc. Data Mining is an
iterative process consisting the following list of processes:
• data cleaning
• data integration
• data selection
• data transformation
• data mining
• pattern evaluation
• knowledge presentation
The complete data mining process is given in Figure 3.1.
Data cleaning task handles missing and redundant data in the source
file. The real world data can be incomplete, inconsistent and corrupted. In
this process, missing values can be filled or removed, noise values are
smoothed, outliers are identified and each of these deficiencies are handled
by different techniques.
Data integration process combines data from various sources. The
source data can be multiple distinct databases having different data
definitions. In this case, data integration process inserts data into a single
coherent data store from these multiple data sources.
In the data selection process, the relevant data from data source are
retrieved for data mining purposes.
53
Data transformation process converts source data into proper format
for data mining. Data transformation includes basic data management tasks
such as smoothing, aggregation, generalization, normalization and attributes
construction.
In Data mining process, intelligent methods are applied in order to
extract data patterns.
Pattern evaluation is the task of discovering interesting patterns
among extracted pattern set.
Knowledge representation includes visualization techniques, which
are used to interpret discovered knowledge to the user.
Data Mining has various application areas including banking, biology,
e-commerce etc. These are most well-known and classical application areas.
On the other hand, the new data mining applications include processing
spatial data, multimedia data, time-related data and World Wide Web.
World Wide Web is one of the largest and most widely known data
source. Today, www contains billions of documents edited by millions of
people. The total size of the whole documents can be interpreted in many
terabytes. All documents on www are distributed over millions of computers
that are connected by telephone lines, optical fibers and radio modems.
Www is growing at a very large rate in size of the traffic, the amount of the
documents and the complexity of web sites. Due to this trend, the demand
for extracting valuable information from this huge amount of data source is
increasing everyday. This leads to new area called Web Mining2, which is
the application of data mining techniques to World Wide Web.
54
3.2 Web Mining
3.2.1 General Overview of Web Mining
With the rapid and explosive growth of information available over the
Internet, World Wide Web has become a powerful platform to store,
disseminate and retrieve information as well as mine useful knowledge. Due
to the properties of the huge, diverse, dynamic and unstructured nature of
Web data, Web data research has encountered a lot of challenges, such as
scalability, multimedia and temporal issues etc. As a result, Web users are
always drowning in an “ocean” of information and facing the problem of
information overload when interacting with the web. Typically, the
following are the problems mentioned in Web related research and
applications.
3.2.1.1 Finding relevant information : To find specific information on the
web, users often either browse Web documents directly or use a search
engine as a search assistant. When a user utilizes a search engine to locate
information, the user often enters one or several keywords as a query, then
the search engine returns a list of ranked pages based on the relevance to the
query. However, there are usually two major concerns associated with the
query-based Web search3 namely low precision and low recall. Low
precision is caused by a lot of irrelevant pages returned by the search engine
and low recall is due to the lack of capability of indexing all Web pages
available on the Internet. This causes the difficulty in locating the unindexed
information that is actually relevant.
55
3.2.1.2 Finding needed information: Most search engines perform in a
query-triggered way that is mainly on a basis of one keyword or several
keywords entered. Sometimes the results returned by the search engine don’t
exactly match what a user really needs due to the fact of the existence of the
homology. For example, when one user with an information technology
background wishes to search information with respect to “Python”
programming language, the user might be presented with information on the
creatural python, one kind of snake rather than the programming language,
given entering only one “python” word as query. In other words, the
semantics of Web data4 is rarely taken into account in the context of Web
search.
3.2.1.3 Learning useful knowledge: With traditional Web search service,
query results relevant to query input are returned to Web users in a ranked
list of pages. In some cases, the users are interested in not only browsing the
returned collection of Web pages, but also extracting potentially useful
knowledge out of them (data mining oriented). More interestingly, more
studies5-7 have been conducted on how to utilize the Web as a knowledge
base for decision making or knowledge discovery recently.
3.2.1.4 Recommendation/personalization of information: While a user
interacts with the web, there is a wide diversity of user’s navigational
preference, which results in needing different contents and presentations of
information. Thus, to improve the Internet service quality and increase the
user click rate on a specific website, it is necessary for a Web developer or
designer to know what the user really wants to do, predict which pages the
user is potentially interested in, and present the customized Web pages to the
user by learning user navigational pattern knowledge4, 8, 9.
56
The above problems place the existing search engines and other Web
applications under significant stress. A variety of efforts have been
contributed to deal with these difficulties by developing advanced
computational intelligent techniques or algorithms from different research
domains, such as database, data mining, machine learning, information
retrieval and knowledge management etc.
The challenges listed above leads to a research for effective discovery
and use of resources in World Wide Web, which also leads to web mining.
The Whole schema results in new research area called Web Mining. Indeed,
there is no major difference between data mining and web mining. Web
Mining can be defined as application of data mining techniques to extract
knowledge from the web data including web documents, hyperlinks between
documents, usage logs of websites, etc.10. There are two different approaches
for defining Web mining. The first approach is a ‘process-centric view’,
which defines Web mining as a sequence of ordered tasks11. Second one is a
’data-centric view’, which defines web mining with respect to the types of
web data that was used in the mining process12. The data-centric definition
has become more acceptable
3, 13, 14
. Web Mining can be classified with
respect to data it uses. Web involves three types of data13; the actual data on
the WWW, the web log data obtained from the users who browsed the web
pages and the web structure data. Thus, the web mining focuses on three
important dimensions; web structure mining, web content mining and web
usage mining. The detailed overview of web mining categories is given in
Section 3.3.
57
3.2.2 Types of Web Data
World Wide Web contains various information sources in different
formats. As it is stated above World Wide Web involves three types of data,
the categorization is given in Figure 3.2.
Web content data is the data, which web pages are designed for
presenting to the users. Web content data consists of free text, semistructured data like HTML pages and more structured data like automatically
generated HTML pages, XML files or data in tables related to web content.
Textual, image, audio and video data types falls into this category.
The most common web content data is HTML pages in the web.
HTML (Hypertext Markup Language) is designed to determine the logical
organizations of documents with hypertext extensions. HTML was firstly
implemented by Tim Berners-Lee at CERN, and became popular by the
Mosaic browser developed at NCSA. In 1990s it has become widespread
with the growth of the Web. After that, HTML has been extended in various
ways. The www depends on the web page authors and vendors sharing the
same conventions of HTML. Different browsers in various formats can view
an HTML document in different ways. To illustrate, one browser may indent
the beginning of a paragraph, while another may only leave a blank line.
However, base structure remains the same and the organization of document
is constant. HTML instructions divide the text of a web page into sub blocks
called elements. The HTML elements can be examined in two categorizes:
those that define how the body of the document is to be displayed by the
browser, and those that define the information about the document, such as
the title or relationships to other documents.
58
Figure 3.2 Types of Web Data
Another common web content data is the XML documents. XML is a
markup language for documents containing structured information.
Structured information contains both the content and the information about
what content includes and stands for. Almost all documents have some
structure. XML has been accepted as a markup language, which is a
mechanism to identify structures in a document. XML specification
determines a standard way to add markup to documents. XML doesn’t
specify semantic or tag set. In fact it is a meta-language for describing
markups. It provides mechanism to define tags and the structural
relationships. All of the semantics of an XML document will either be
defined by the applications that process them or by style sheets.
Dynamic server pages are also important part of web content data.
Dynamic content can be any web content, which is processed or compiled by
the web server before sending the results to the web browser. On the other
hand, static content is content, which is sent to the browser without
modification. Common forms of dynamic content are Active Server Pages
59
(ASP), Pre-Hypertext Processor (PHP) pages and Java Server Pages (JSP).
Today, several web servers support more than one type of active server
pages.
The size of the web graph is varying from one domain to another
domain. An example the graph of a particular web domain is given in the
Figure 3.3.
Figure 3.3 An Example Web Graph for a Particular Web Domain
The edges of web graph has the following semantics: Outgoing arcs
stand for hypertext links contained in the corresponding page and incoming
arcs represent the hypertext links through which the corresponding page is
reached. Web graph is used in applications such as web indexing, detection
of web communities and web searching. The whole web graph grows with
an amazing rate. More specifically, in January 2010, it was estimated that
whole web graph consists of about 10.2 billions nodes15 and 150 billions
edges, since the average node has roughly seven hypertext links (directed
edges) to other pages16. In addition, approximately 20.3 millions nodes are
60
added every day and many nodes are modified or removed, so that the Web
graph might currently contains more than 10 billions nodes and about 100
billions edges in all.
Web usage data includes web log data from web server access logs,
proxy server logs, browser logs, registration data, cookies and any other data
generated as the results of web user interactions with web servers. Web log
data is created on web server. Every Web server has a unique IP address and
a domain name. When any user enters (a URL) in any browser, this request
is sent to the web server. After that operation, web server fetches the page
and sends it to user’s browser. Web server data are created from the
relationship between web user’s interaction with a web site and the web
server. A web server log, containing Web server data, is created as a result of
the http process that is run on Web servers17. All types of server activities
such as success, errors, and lack of response are logged into a server log
file18. Web servers dynamically produce and update four types of “usage”
log files: access log, agent log, error log, and referrer log.
Web Access Logs have fields containing web server data, including
the date, time, user’s IP address, user action, request method and requested
data. Error Logs includes data about specific events such as "file not found,"
"document contains no data," or configuration errors; providing server
administrator information on “problematic and erroneous” links on the
server. Other type of data recorded to the error log is aborted transmissions.
Agent logs provide data about the browser, browser version, and operating
system of the requesting user.
61
Generally, Web server logs are stored in Common Logfile Format
[CLF] or Extended Logfile Format [ELF]. Common Logfile Format includes
date, time, client IP, remote log name of a user, bytes transferred, server
name, requested URL, and http status code returned. Extended Logfile
Format includes bytes sent and received, server name, IP address, port,
request query, requested service name, time elapsed for transaction to
complete, version of transfer protocol used, user agent which is the browser
program making the request, cookie ID, and referrer. Web server logging
tools, also known as Web traffic analyzers, analyze the log files of a Web
server and produce reports from this information from this data source.
These data can be used in the planning and optimizing web site structure.
3.3 Web Mining Categories
Web Mining can be broadly divided into three categories according to the
kinds of data to be mined10 :
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
Figure 3.4 Taxonomy of Web Mining
62
Web content mining is the task of extracting knowledge from the
content of documents on World Wide Web like mining the content of html
files. Web document text mining, resource discovery based on concepts
indexing or agent-based technology fall in this category.
Web structure mining is the process of extracting knowledge from
the link structure of the World Wide Web.
Web usage mining, also known as Web Log Mining, is the process of
discovering interesting patterns from web access logs on servers. The
Taxonomy of Web Mining is given in Figure 3.4.
3.3.1 Web Content Mining
Web Content Mining is the process of extracting useful information
from the contents of Web documents. Content data is the collection of
information designed to be conveyed to the users. It may consist of text,
images, audio, video, or structured records such as lists and tables. Text
mining and its application to Web content has been the most widely studied
forms of web content mining. Some of the issues including the text mining
are; topic discovery, extracting association patterns, clustering of web
documents and classification of Web Pages. These fields also involve using
techniques from other disciplines such as Information Retrieval (IR) and
Natural Language Processing (NLP). There is also significant body of work
exist for discovering knowledge from images in the fields of image
processing and computer vision. The application of these techniques to Web
content mining has not been very effective yet.
63
The unstructured documents like pure text files also falls into web
content mining. Unstructured documents are free texts in www such as
newspaper pages. Most of the researches in this area uses bag of words in
order to represent unstructured documents19. Other researches in web content
mining includes Latent Semantic Indexing (LSI)20 which tries to transform
original document vectors to a lower dimensional space by analyzing
structure of elements in the document pile. Another important information
resource used in web content mining is the positions of words21,
22
in the
document. There are also researches focusing on this topic for solving
document categorization problem. The use of text compression is also new
research area for text classification task. Web content mining applications
range from text classification or text categorization to finding extracting
pattern or rules. Topic detection and tracking problems are also research
areas related to web content mining. Text Mining methods with their
document representations included in web content mining are given in
Table-1 below3.
64
Table 3.1 Text Mining Method Included in Web Content Mining
Document Representation
Bag of Words
Method
Episode Rules
TFIDF
Naive Bayes
Bayes Nets
Support Vector Machines
Hidden Markov Models
Maximum Entropy
Rule Based System
Boosted Decision Trees
Neural Networks
Logistic Regression
Clustering Algorithms
K-nearest Neighbor
Decision Trees
Bag of Words with n-grams
Self Organizing Maps
Unsupervised Hierarchical Clustering
Decision Trees
Statistical Analysis
Word positions
Relational
Phrases
Episode Rules
Propositional Rule Based Systems
Decision Trees
Naive Bayes
Bayes Nets
Support Vector Machines
Rule Based System
Clustering Algorithms
K-nearest Neighbor
Decision Trees
Relative Entropy
Association Rules
Rule Based System
Rule Learning
Text Compression
Concept Categories
Terms
Hyponyms and synonyms
Sentences and clauses
Named Entity
65
3.3.2 Web Structure Mining
As it is stated above the web graph composed of web pages as nodes
and hyperlinks as edges, which represents the connection between two web
pages. Web structure mining can be defined as a task of discovering
structure information from the web. The aim of web structure mining is to
produce structural information about the web site and its web pages. Unlike
Web content mining, which mainly concentrates on the information of single
document, web structure mining tries to discover the link structures of the
hyperlinks between documents. By using topology information of
hyperlinks, web structure mining can classify the Web pages and produce
results such as the similarity and relationship between different Web sites.
Web Structure Mining can be classified into two categories based on
the type of structure data used. The structural data for Web structure mining
is the link information and document structure. Given a collection of web
pages and topology, interesting facts related to page connectivity can be
discovered. There has been a detailed study about inter-page relations and
hyperlink analysis23, which provides an up-to-date survey. In addition, web
document contents can also be represented in a tree-structured format, based
on the different HTML and XML tags within the page. Recent studies24, 25
have focused on automatically extracting document object model (DOM)
structures out of documents.
Interesting facts describing the connectivity in the Web subset can be
discovered based on the given collection of connected web documents. The
structure information obtained from the Web structure mining has the
followings:
66
• The information about measuring the frequency of the local links in the
web tuples in a web table
• The information about measuring the frequency of web tuples in a web
table containing links within the same document
• The information measuring the frequency of web tuples in a web table
that contains links that are global and the links that point towards
different web sites
• The information measuring the frequency of identical web tuples that
appear in the web tables.
In general, if a web page is connected to another web page with
hyperlinks or the web pages are neighbors, the relationship between these
web pages needs to be discovered. The relations between these web pages
are categorized with respect to different properties: They may be related by
synonyms or ontology, they may have similar contents, they may be on the
same server or same person may create them. Another task of web structure
mining is to discover the nature of the hierarchy or network of hyperlink in
the web sites of a particular domain. This may help to generalize the flow of
information in Web sites that may represent some particular domain;
therefore the query processing can be performed easier and more efficient.
Web structure mining has a strong relation with the web content mining,
Because Web documents contain links, and they both use the real or primary
data on the Web. It's possible to observe that these two mining areas are
applied to same task.
Web structure data describes the organization of the content. Intrapage structure information includes the arrangement of various HTML or
XML tags within a given page. Inter-page structure information is hyperlinks connecting one page to another. Web graph is constructed by
67
hyperlinks information from web pages. The web graph has been widely
adopted as the core describing the web structure. It is most widely accepted
way of representing web structure related to web page connectivity (dynamic
and static links).
The Web graph is a representation of the WWW at a given time26. It
stores the link structure and connectivity between the HTML documents in
the www. Each node in the graph corresponds to a unique web page or a
document. An edge represents an HTML link from one page to another. The
general properties of web graphs are given below:
• Directed, very large and sparse.
• Highly dynamic
– Nodes and edges are added/deleted very often
– Content of the existing nodes are also subject to change
– Pages and hyperlinks created on the fly
• Apart from primary connected component there are also smaller
disconnected components
3.3.3 Web Usage Mining
Web Usage Mining is the process of applying data mining techniques
to discover interesting patterns from Web usage data. Web usage mining
provides better understanding for serving the needs of Web-based
applications27.
The quality of the patterns discovered in web usage mining process
highly depends on the quality of the data used in the mining processes. Web
usage data contains information about the Internet addresses of web users
with their navigational behaviour. The basic information source for web
usage mining itself can be classified in three categories, these are:
68
Application Server Data: Application server softwares like Web logic,
Broad Vision, Story Server used for e-commerce applications, have
important properties in their structure. These properties will allow many
e-commerce applications to be built on top of them. One of the most
important properties of application servers is their ability to keep track of
several types of business transactions and record them in application server
logs.
Application Level Data: At the application server, the number of event
types are increased while moving to upper layers. Application level data can
be logged in order to generate histories of specially defined events. This type
of data is classified in three categories based on the source of information.
These categories are: server side, client side and proxy side data. Server side
data gives information about the behaviors of all users, whereas the client
side data gives information about a user using that particular client. Proxy
side data is somewhere in between the client and server side data.
Web Server Data: This is the most commonly used data type in web usage
mining applications. It is the data obtained from user logs that are kept by a
web server. The basic information source in most of the web usage mining
applications is the access log files at server side. When any user agent (e.g.,
IE, Mozilla, Netscape, etc) hits an URL in a domain, the information related
to that operation is recorded in an access log file.
69
Access log file on the server side contains log information of user that
opened a session. These logs include the list of items that a user agent has
accessed. The log format of the file is Common Log Format [CLF], which
includes special record formats. These records have seven common fields,
which are:
1. User’s IP address
2. Access date and time
3. Request method (GET or POST),
4. URL of the page accessed,
5. Transfer protocol (HTTP 1.0, HTTP 1.1),
6. Success of return code.
7. Number of bytes transmitted.
The information in this record can be used to recover session
information. The attributes that are needed to obtain session information in
these tuples are:
1. User’s IP address
2. Access date and time
3. URL of the page accessed
The possible application areas of web usage mining are prediction of
the user's behavior within the site, comparison between expected and actual
Web site usage, reconstruction of web site structure based on the interests of
its users.
Many researches have been done in the Database, Information
Retrieval, Intelligent Agents and Topology, which provide basic foundation
for web content mining, web structure mining. However, web usage mining
is a relatively new research area, and has more and more attentions in recent
years. After this general introduction to the web usage mining, phases of web
usage mining are given in the next section.
70
3.4 Characteristics of Web Data
For the data on the web, it has its own distinctive features compared to
the data in conventional database management systems. Web data usually
exhibits the following characteristics:
 The data on the Web is huge in amount. Currently, it is hard to
estimate the exact data volume available on the Internet due to the
exponential growth of Web data every day. For example, in 2009,
one of the first Web search engines, the Google had an index of 14
billion Web pages and Web accessible documents. As of
November, 2010, the top search engines claim to index from 10
billion to 50 billion Web documents. The enormous volume of data
on the Web makes it difficult to handle Web data via well
traditional database techniques.
 The data on the Web is distributed and heterogeneous. Due to the
essential property of the Web being an interconnection of various
nodes over the Internet, Web data is usually distributed across a
wide range of computers or servers, which are located at different
places around the world. Meanwhile,
 Web data is often exhibiting the intrinsic nature of multimedia: that
is, in addition to textual information, which is mostly used to
express content in terms of text message, many other types of Web
data, such as images, audio files and video slips are often included
in a Web page. It requires the developed techniques for Web data
processing with the ability of dealing with heterogeneity of
multimedia data.
 The data on the Web is unstructured. There are, so far, no rigid and
uniform data structures or schemas which Web pages should
strictly followed, that are common requirements in conventional
database management. Instead, Web designers are able to
71
arbitrarily organize related information on the Web together in
their own ways, as long as the information arrangement meets the
basic layout requirements of Web documents, such as HTML
format. Although Web pages in well-defined HTML format could
contain some preliminary Web data structures, e.g. tags or anchors,
these structural components, however, can primarily benefit the
presentation quality of Web documents rather than reveal the
semantics contained in Web documents. As a result, there is an
increasing requirement to better deal with the unstructured nature
of Web documents and extract the mutual relationships hidden in
Web data for facilitating users to locate needed Web information or
service. The data on the Web is dynamic. The implicit and explicit
structure of Web data is updated frequently. Especially, due to
different applications of web based data management systems, a
variety of presentations of Web documents will be generated as
contents in database updates. And dangling links and relocation
problems will be produced when domain or file names changes or
disappear. This feature leads to frequent schema modifications of
Web documents, which often suffer traditional information
retrieval.
The aforementioned features indicate that Web data is a specific type
of data different from the data residing in traditional database systems. As a
result, there is an increasing demand to develop more advanced techniques to
address Web information search and data management. According to the
aims and purposes, these studies and developments are mainly about two
aspects of Web data management, that is, how to accurately find the needed
information on the Internet, i.e. Web information search, and how to
efficiently and fully manage and utilize the information/knowledge available
72
from the Internet, i.e. Web data/knowledge management. Especially, with
the recent popularity and development of the Internet, such as semantic web,
Web 3.0 and so on, more and more advanced Web data based services and
applications are emerging for Web users to easily locate the needed
information and efficiently share information in a collaborative environment.
3.5 Web Data Search
Web search engine technology28,29 has emerged catering for the rapid
growth and exponential flux of Web data on the Internet, to help Web users
find desired information, and has resulted in various commercial Web search
engines available online such as Yahoo!, Google, AltaVista, TamilHunt and
so on. Search engines can be categorized into two types: one is a generalpurpose search engine and another is a specific-purpose search engine. The
general-purpose search engines, for example, the well-known Google search
engine, are to retrieve as many Web pages available on the Internet that are
relevant to the query as possible to Web users. The returned Web pages to
user are ranked in a sequence according to their relevant weights to the
query, and the satisfaction with the search results from users is dependent on
how quickly and how accurately users can find the desired information.
The specific–purpose search engines, on the other hand, aim at
searching those Web pages for a specific task or an identified community.
For example, Google Scholar and DBLP are two representatives of the
specific-purpose search engines.
The former is a search engine for searching academic papers or books
as well as their citation information for different disciplines, while the latter
is designed for a specific researcher community, i.e. computer science,
which provides various information regarding conferences or journals in the
73
computer science domain, such as conference homepage, abstracts or full
text of papers published in the computer science journals or conference
proceedings.
DBLP has become a helpful and practicable tool for researchers or
engineers in computer science area to find the needed literature easily, or to
assess the track record of one researcher conveniently. No matter which type
the search engine is, each search engine owns a background text database,
which is indexed by a set of keywords extracted from the collected
documents.
To satisfy the higher recall and accuracy rate of the search, Web
search engines are requested to provide an efficient and effective mechanism
to collect and manage the Web data, and the capabilities to match the user
query with the background indexing database quickly and to rank the
returned Web contents in an efficient way so that Web user can locate the
desired Web pages in a short time or via clicking a few hyperlinks. To
achieve these aims, a variety of algorithms or strategies are involved in
handling the above mentioned tasks
28-34
, which lead to a hot and popular
topic in the context of web-based research, i.e. Web data management.
3.6 Web Log Data Collection
Data gathered from web servers is placed into special files called logs
and can be used for web usage mining. Usually this data is called web log
data as all visitors activities are logged into this file3. In real life web log
files are huge source of information. For example, the web log file generated
running online information site produces log with the size of 30 – 40 MB in
one month time, another advertising company collects the file of the size of 6
MB during one day.
74
There are many commercial web log analysis tools (Angoss;
Clementine; MINEit; NetGenesis)35
- 38
. Most of them focus on statistical
information such as the largest number of users per time period, business
type of users visiting the web site (.edu, .com) or geographical location (.in,
.uk, .sg), pages popularity by calculating number of times they have been
visited and etc. However, statistics without describing relationships between
visited
pages
consequently
leave
much
valuable
information
undiscovered38,39.
This lack of depth of analytic scope has stimulated web log research
area to expand to an individual research field beneficial and vital to
e-business components. The main goals which might be achieved mining
web log data are the following:
 Web log examination enables to restructure the web site to let
clients access the desired pages with the minimum delay. The
problem of how to identify usable structures on the WWW related
with understanding what facilities are available for dealing with
this problem and how to utilize them41.
 Web log inspection allows improving navigation. This can
manifest itself by organizing important information into the right
places, managing links to other pages in the correct sequence, preloading frequently used pages.
 Attracting more advertisement capital by placing adverts into the
most frequently accessed pages.
 Interesting patterns of customer behaviour can be identified. For
example, valuable information can be gained by discovering the
most popular paths to the specific web pages and paths users take
upon leaving these pages. These findings can be used effectively
for redesigning the web site to better channel users to specific web
pages.
75
 Turning non-customers into customers increasing the profit42.
Analysis should be provided on both groups: customers and
noncustomers in order to identify characteristic patterns. Such
findings would help to review customers’ habits and help site
maintainers to incorporate these observed patterns into the site
architecture and thereby assist turning the non-customer into a
customer.
 From empirical findings43 it is observed that people tend to revisit
pages just visited and access only a few pages frequently. Humans
browse in small clusters of related pages and generate only short
sequences of repeated URLs. This shows that there is no need to
increase number of information pages on the web site. More
important is to concentrate to the efficiency of the material placed
and accessibility of these clusters of pages.
General benefits obtained from analysing Web logs are allocating
resources more efficiently, finding new growth opportunities, improving
marketing campaigns, new product planning, increasing customer retention,
discovering cross selling opportunities and better forecasting.
3.7 The common log format
Various web servers generate different formatted logs: CERF Net,
Cisco PIX, Gauntlet, IIS standard/Extended, NCSA Common/Combined,
Netscape Flexible, Open Market Extended, Raptor Eagle. Nevertheless, the
most common log format is common log format (CLF) and appears exactly
as follows (see Figure 1.5):
76
Figure 3.5. Example of the Common Log Format: IP address, authentication
(rfcname and logname), date, time, GTM zone, request method, page name,
HTTP version, status of the page retrieving process and number of bytes
transferred.
host/ip rfcname logname [DD/MMM/YYYY:HH:MM:SS -0000]"METHOD
/PATH HTTP/1.0" code bytes.
 host/ip - is visitor’s hostname or IP address.
 rfcname - returns user’s authentication. Operates by looking up
specific TCP/IP connections and returns the user name of the process
owning the connection. If no value is present, a "-" is assumed.
 logname - using local authentication and registration, the user's log
name will appear; otherwise, if no value is present, a "-" is assumed.
 DD/MMM/YYYY:HH:MM:SS - 0000 this part defines date
consisted of the day (DD), month (MMM), years (YYYY). Time
stamp is defined by hours (HH), minutes (MM), seconds (SS). Since
web sites can be retrieved any time of the day and server logs user’s
time, the last symbol stands for the difference from Greenwich Mean
Time (for example, Indian Standard Time is -05.30).
 method - methods found in log files are PUT, GET, POST, HEAD44.
PUT allows user to transfer/send a file to the web server. By default,
PUT is used by web site maintainers having administrator’s privileges.
77
For example, this method allows uploading files through the given
form on the web. Access others then site maintaining is forbidden.
GET transfers the whole content of the web document to the user.
POST sends information to the web server that a new object is created
and linked. The content of the new object is enclosed as the body of
the request and is transferred to the user. Post information usually goes
as an input to the Common Gateway Interface (CGI) programs. HEAD
demonstrates the header of the “page body”. Usually it is used to
check the availability of the page.
 path stands for the path and files retrieved from the web server.
 HTTP/1.0 defines the version of the protocol used by the user to
retrieve information from the web server.
 code identifies the success status. For example, 200 means that the file
has been retrieved successfully, 404 - the file was not found, 304 - the
file was reloaded from cache, 204 indicates that upload was completed
normally and etc.
 bytes number of bytes transferred from the web server to another
machine.
Figure 3.6. CLF followed by additional data fields: web page name visitor
gets from - referrer page, browser type and cookie information
78
It is possible to adjust web server’s options to collect additional
information
such
as
REFERER_URL,
HTTP_USER_AGENT
and
HTTP_COOKIE (Fleishman 1996). REFERER_URL defines URL names
where from visitors came. HTTP_USER_AGENT identifies browser’s
version the visitors use. HTTP_COOKIE variable is a persistent token which
defines visitors identification number during browsing sessions. Then CLF is
a form depicted in Figure 3.6.
3.8 Web Log Data Pre-Processing Steps
Web log data pre-processing step is a complex process. It can take up
to 80% of the total KDD time45 and consists of stages presented in
Figure 3.7. The aim of data pre-processing is to select essential features,
clean data from irrelevant records and finally transform raw data into
sessions. The latter step is unique, since session creation is appropriate just
for web log datasets and involves additional work caused by user
identification problem and various nonsettled standpoints how sessions must
be identified.
All these stages will be analysed in more detail in order to understand
why pre-processing plays an important role in KDD process mining complex
web log data.
Figure 3.7. Pre-processing web log data is one of the most complex part’s in
KDD process.
79
3.9 Web Personalization
Web personalization is the process of customizing a Web site to the
needs of specific users, taking advantage of the knowledge acquired from the
analysis of the user's navigational behavior (usage data) in correlation with
other information collected in the Web context, namely, structure, content,
and user profile data. Due to the explosive growth of the Web, the domain of
Web personalization has gained great momentum both in the research and
commercial areas.
The steps of a Web personalization process include: (a) the collection
of Web data, (b) the modelling and categorization of these data (preprocessing phase), (c) the analysis of the collected data and (d) the
determination of the actions that should be performed. The ways that are
employed in order to analyse the collected data include content-based
filtering, collaborative filtering, rule-based filtering and Web usage mining.
The site is personalized through the highlighting of existing hyperlinks, the
dynamic insertion of new hyperlinks that seem to be of interest for the
current user, or even the creation of new index pages.
Content-based filtering systems are solely based on individual users’
preferences. The system tracks each user’s behaviour and recommends them
items that are similar to items the user liked in the past.
Collaborative filtering systems invite users to rate objects or divulge
their preferences and interests and then return information that is predicted to
be of interest for them. This is based on the assumption that users with
similar behaviour (for example users that rate similar objects) have
analogous interests.
80
In rule-based filtering the users are asked to answer to a set of
questions. These questions are derived from a decision tree, so as the user
proceeds on answering them, what she/he finally receives as a result (for
example a list of products) is tailored to their needs. Content-based, rulebased and collaborative filtering may also be used in combination, for
deducing more accurate conclusions.
In this thesis the main focus is on Web usage mining. This process
relies on the application of statistical and data mining methods to the Web
log data, resulting in a set of useful patterns that indicate users’ navigational
behaviour. The data mining methods that are employed are: association rule
mining, sequential pattern discovery, clustering and classification. This
knowledge is then used from the system in order to personalize the site
according to each user’s behaviour and profile. The block diagram illustrated
in Figure 3.8 represents the functional architecture of a Web personalization
system in terms of the modules and data sources that were described earlier.
The content management module processes the Web site’s content and
classifies it in conceptual categories. The Web site’s content can be
enhanced with additional information acquired from other Web sources,
using advanced search techniques. Given the site map structure and the
usage logs, a Web usage miner provides results regarding usage patterns,
user behaviour, session and user clusters, click-stream information etc.
Additional information about the individual users can be obtained by the
user profiles.
81
Figure 3.8 Modules of a Web personalization system
Moreover, any information extracted from the Web usage mining
process concerning each user’s navigational behaviour can then be added to
his/her profile. All this information about nodes, links, Web content, typical
behaviours and patterns is conceptually abstracted and classified into
semantic categories. Any information extracted from the interrelation
between knowledge acquired using usage mining techniques and knowledge
acquired from content management will then provide the framework for
evaluating possible alternatives for restructuring the site. A publishing
mechanism will perform the site modification, ensuring that each user
navigates through the optimal site structure. The available content options
for each user will be ranked according to user's interests.
82
3.10 User Profiling
In order to personalize a Web site, the system should be able to
distinguish between different users or groups of users. This process is called
user profiling and its objective is the creation of an information base that
contains the preferences, characteristics and activities of the users. In the
Web domain and especially in e-commerce, user profiling has been
developed significantly since Internet technologies provide easier means of
collecting information about the users of a Web site, which in the case of ebusiness sites are potential customers. A user profile can be either static,
when the information it contains is never or rarely altered (e.g. demographic
information), or dynamic when the user profile’s data change frequently.
Such information is obtained either explicitly, using on-line registration
forms and questionnaires resulting in static user profiles, or implicitly, by
recording the navigational behaviour and/or the preferences of each user,
resulting in dynamic user profiles.
A way of uniquely identifying a visitor through a session is by using
cookies. A cookie is defined as “the data sent by a Web server to a Web
client, stored locally by the client and sent back to the server on subsequent
requests.” In other words, a cookie is simply an HTTP header that consists of
a text-only string, which is inserted into the memory of a browser. It is used
to uniquely identify a user during Web interactions within a site and contains
data parameters that allow the remote HTML server to keep a record of the
user identity, and what actions he takes at the remote Web site. The contents
of a cookie file depend on the Web site that is being visited. In general,
information about the visitor’s identification is stored, along with password
information. Additional information such as credit card details, if one is used
during a transaction, as well as details concerning the visitor’s activities at
the Web site, for example, which pages were visited, which purchases were
83
made, or which advertisements were selected, can also be included. Often,
cookies point back to more detailed customer information stored at the Web
server.
Another way of uniquely identifying users through a Web transaction
is by using identd, an identification protocol specified in RFC 1413 that
provides a means to determine the identity of a user of a particular TCP
connection. Given a TCP port number pair, it returns a character string,
which identifies the owner of that connection (the client) on the Web
server’s system. Finally, a user can be identified making the assumption that
each IP corresponds to one user. In some cases, IP addresses are resolved
into domain names that are registered to a person or a company, thus more
specific information is gathered.
As already mentioned, user profiling information can be explicitly
obtained by using online registration forms requesting information about the
visitor, such as name, age, sex, likes, and dislikes. Such information is stored
in a database, and each time the user logs on the site, it is retrieved and
updated according to the visitor’s browsing and purchasing behavior. All of
the aforementioned techniques for profiling users have certain drawbacks.
First, in the case where a system depends on cookies for gathering
user information, there exists the possibility of the user having turned off
cookie support on his browser. Other problems that may occur when using
cookies technology are the fact that because a cookie file is stored locally in
the user’s computer, the user might delete it and when the user revisits a
Web site, it will be regarded as a new visitor. Furthermore, if no additional
information is provided (e.g., some logon id), there occurs an identification
problem if more than one user browses the Web using the same computer. A
similar problem occurs when using identd, inasmuch as the client should be
configured in a mode that permits plaintext transfer of ids.
84
A potential problem in identifying users using IP address resolving, is
that in most cases this address is that of the ISP, and that does not suffice for
specifying the user’s location. On the other hand, when gathering user
information through registration forms or questionnaires, many users submit
false information about themselves and their interests resulting in the
creation of misleading profiles. In the latter case, there are two further
options: either regarding each user as a member of a group and creating
aggregate user profiles, or addressing any changes to each user individually.
When addressing the users as a group, the method used is the creation of
aggregate user profiles based on rules and patterns extracted by applying
Web usage mining techniques to Web server logs. Using this knowledge, the
Web site can be appropriately customized.
3.11 Privacy Issues
The most important issue that should be encountered during the user
profiling process is privacy violation. Many users are reluctant to giving
away personal information either implicitly or explicitly, being hesitant to
visit Web sites that use cookies (if they are aware of their existence) or
avoiding to disclose personal data in registration forms. In both cases, the
user looses anonymity and is aware that all of their actions will be recorded
and used, in many cases without their consent. Additionally, even if a user
has agreed to supply personal information to a site, through cookie
technology such information can be exchanged between sites, resulting to its
disclosure without the user’s permission.
P3P (Platform for Privacy Preferences) is a W3C proposed
recommendation [P3P] that suggests an infrastructure for the privacy of data
interchange. This standard enables Web sites to express their privacy
practices in a standardized format that can be automatically retrieved and
85
interpreted by user agents. Therefore, the process of reading privacy policies
will be simplified for the users, since key information about what data is
collected by a Web site can be automatically conveyed to a user, and
discrepancies between a site's practices and the user's preferences concerning
the disclosure of personal data will be automatically flagged. P3P, however,
does not provide a mechanism for ensuring that sites actually act according
to their policies.
3.12 Tools & Applications
Some of the most popular Web sites that use methods such as decision
tree guides, collaborative filtering and cookies in order to profile users and
create customized Web pages are listed. Additionally, a brief description of
the most important tools available for user profiling is given. An overview
along with products’ references is provided in Table 3.2.
Table 3.2 User Profiling Tools
Vendor
BroadVision
[BRO]
Macromedia
[MAC]
Product
Name
One-To-One
LikeMinds
Microsoft
Firefly
[MSF]
Passport
NetPerceptions
[NPE]
Neuromedia
[NME]
Collaborative
Filtering
Group Lens
Learn
[OSE]
Sesame
Cookies
User
Registration
*
*
*
*
*
NeuroStudio
OpenSesame
Page
Customization
86
*
*
*
*
*
Popular Web sites such as Yahoo!, Excite or Microsoft Network
[MSN] allow users customize home pages based on their selections of
available content, using information supplied by the users and cookies
thereafter. In that way, each time the user logs in the site, what user sees is a
page containing information addressed to user’s interests. Rule-based
filtering is used from online retailers such as Dell and Apple Computer,
giving users the ability to easily customize product configuration before
ordering. As long as recommendation systems are concerned, the most
popular example is amazon.com. The system analyses past purchases and
posts suggestions on the shopper's customized recommendations page. Users
who haven't made a purchase before can rate books and see listings of books
they might like. The same approach, based on user ratings, is used in many
similar online shops, such as CDNOW.
Commercial Web sites, including many search engines such as AltaVista or Lycos, have associations with commercial marketing companies,
such as DoubleClick Inc. These sites use cookies to monitor their visitors’
activities, and any information collected is stored as a profile in
DoubleClick’s database. DoubleClick then uses this profile information to
decide which advertisements or services should be offered to each user when
she/he visits one of the affiliated DoubleClick sites. Of course, this
information is collected and stored without the users’ knowledge and more
importantly, consent.
There are several systems available for creating user profiles. They
vary according to the user profiling method that they use. These include
a) Broadvision’s One-To-One, a high-end marketing tool designed to let
sites recognize customers and display relevant products and services
(customers include Kodak Picture Network, and US West), b) Net
87
Perception’s GroupLens, a collaborative filtering solution requiring other
users to actively or passively rate content (clients include Amazon.com and
Musicmaker), c) Open Sesame’s Learn Sesame, a cookie-based product
(clients include Ericsson and Toronto Dominion Bank), d) the early leader in
collaborative filtering Firefly Passport, developed by MIT Media Lab and
now owned by Microsoft (clients include Yahoo, Ziff-Davis and
Barnes&Noble), e) Macromedia’s LikeMinds Preference Server, another
collaborative filtering system that examines users’ behaviour and finds other
users with similar behaviours in order to create a prediction or product
recommendation (clients include Cinemax-HBO’s Movie Matchmaker and
Columbia
House’s
Total
E
entertainment
site),
f)
Neuromedia’s
NeuroStudio, an intelligent-agent software that allows Webmasters to give
users the option to create customized page layouts, using either cookies or
user log-in (customers include Intel and Y2K Links Database site), and g)
Apple’s WebObjects, a set of development tools that allow customized data
design (clients include The Apple Store and Cybermeals)46.
3.13 Summary
In this chapter, a detailed KDD schema is demonstrated and
explanation is provided for each step. Established relationship between data
mining and its branch – web mining. Investigated essential characteristics of
web mining. Taxonomy is depicted and demonstrated that web mining
consist of 3 subareas: web structure mining, web usage mining, web content
mining. Peculiarities of each web mining subareas and tasks that can be
achieved using various data related to the web are briefly described.
Analysis of data collection sources is provided with the different formats of
web data. Material which is collected and presented in this chapter is a
comprehensive guide into web mining area.
88
Reference
1. J. Han, M. Kamber (2000): Data Mining: Concepts and Techniques
Morgan Kaufmann.
2. O. Etzioni (1996), “The World Wide Web: Quagmire or Gold Mine”,
in Communications of the ACM, 39(11):65-68.
3. Kosala, R. and H. Blockeel, Web Mining Research: A Survey.
SIGKDD Explorations, 2000. 2(1): p. 1-15.
4. Ghani, R. and A. Fano. Building Recommender Systems Using a
Knowledge Base of Product Semantics. in Proceedings of the
Workshop on Recommendation and Personalization in E-Commerce,
at the 2nd International Conference on Adaptive Hypermedia and
Adaptive Web Based Systems (AH2002). 2002, p. 11-19, Malaga,
Spain.
5. Chakrabarti, S., et al. The Structure of Broad Topics on the Web. in
Proceeding of 11th International World Wide Web Conference. 2002,
p. 251 - 262, Honolulu, Hawaii, USA.
6. Büchner, A.G. and M.D. Mulvenna, Discovering Internet Marketing
Intelligence through Online Analytical Web Usage Mining. SIGMOD
Record, 1998. 27(4): p.54-61.
7. Chang, G., et al., eds. Mining the World Wide Web: An Information
Search Approach. The Information Retrieval. Vol. 10. 2001, KAP.
8. Pierrakos, D., et al. Web Community Directories: A New Approach to
Web Personalization. in Proceeding of the 1st European Web Mining
Forum (EWMF'03). 2003, p. 113-129, Cavtat-Dubrovnik, Croatia.
9. Mobasher, B., Web Usage Mining and Personalization, in Practical
Handbook of Internet Computing, M.P. Singh, Editor. 2004, CRC
Press. p. 15.1-37.
89
10. J. Srivastava, P. Desikan and V. Kumar (2002) "Web Mining:
Accomplishments
&
Future
Directions",
National
Science
Foundation Workshop on Next Generation Data Mining (NGDM'02)
11. O. Etzioni (1996), “The World Wide Web: Quagmire or Gold Mine”,
in Communications of the ACM, 39(11):65-68.
12. R. Cooley (2000) Web Usage Mining: Discovery and Application of
Interesting Patterns from Web Data. Ph.D. Thesis. University of
Minnesota. May 2000.
13. S. K. Madria, S. S. Bhowmick (1999), W. K. Ng, E. P. Lim: Research
Issues in Web Data Mining. DaWaK: 303-312.
14. J. Borges, M. Levene (1998), “Mining Association Rules in
Hypertext Databases”, in Proceedings of the Fourth International
Conference on Knowledge Discovery and Data Mining (KDD-98),
New York City.
15. B. Murray and A. Moore (2002). Sizing the Internet. White paper,
Cyveillance.
16. J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, A.
Tomkins (1999). The Web as a Graph: Measurements, Models, and
Methods.
Proceedings
of
the
International
Conference
on
Combinatorics and Computing, pp. 1-18.
17. A. G. Buchner, & M. D. Mulvenna (1998). Discovering Internet
marketing intelligence through online analytical web usage mining.
SIGMOD Record, 27 (4), 54-61.
18. J. C. Bertot, C. R. McClure, W. E. Moen & J. Rubin (1997). Web
usage statistics: Measurement issues and analytical techniques.
Government Information Quarterly, 14 (4), 373-395.
19. G. Salton and M. McGill (1983), Introduction to Modern information
Retrieval. McGraw Hill.
90
20. S. Deerwester, S. Dumains, G. Furnas, T. Landauer and R. Harshman
(1990). Indexing by Latent Symantic Analysis. Journal of American
Society for Information Science. 41(6): 391-407.
21. H. Ahonen, O. Heionen, M. Klemettinen and A. Verkamo (1998).
Applying data mining techniques for descriptive phrase extraction in
digital document collections. In advances in Digital Libraries (ADL
98). Santa Barbara California, USA.
22. W.W. Cohen (1995). Learning to classify English text with ilp
methods. In Advances in Inductive Logic Programming. (Ed. L. De
Raed)m IOS Press.
23. P. Desikan, J. Srivastava, V. Kumar, P.-N. Tan (2002), “Hyperlink
Analysis – Techniques & Applications”, Army High Performance
Computing Center Technical Report.
24. K. Wang and H. Lui (1998), “Discovering Typical Structures of
Documents: A Road Map Approach”, in Proceedings of the ACM
SIGIR Symposium on Information Retrieval.
25. C. H. Moh, E. P. Lim, W. K. Ng (2000), “DTD-Miner: A Tool for
Mining DTD from XML Documents”, WECWIS: 144-151.
26. M. Gandhi, K. Jeyebalan, J. Kallukalam, A. Rapkin, P. Reilly, N.
Widodo (2004), Web Research Infrastructure Project Final Report ,
Cornell University.
27. J. Srivastava, R. Cooley, M. Deshpande and P-N. Tan (2000). “Web
Usage Mining: Discovery and Applications of usage patterns from
Web Data”, SIGKDD Explorations, Vol 1, Issue 2.
28. Brin, S. and L. Page, The PageRank Citation Ranking: Bringing
Order
to
the
Web
(http://www-db.stanford.edu/~backrub/
pageranksub.ps.). 1998.
91
29. Ding, C., et al., PageRank, HITS and a Unified Framework for Link
Analysis, L.B.N.L.T. Report, Editor. 2002, University of California,
Berkeley, CA.
30. Borodin, A., et al. Finding Authorities and Hubs from Hyperlink
Structures on the World Wide Web. in Proceedings of the 10th
International World Wide Web Conference. 2001, p. 415-429, Hong
Kong, China.
31. Haveliwala, T. Topic-Sensitive PageRank. in Proceedings of the 11th
International World Wide Web Conference. 2002, p. 517-526,
Honolulu, Hawaii, USA.
32. Kamvar, S., H. TH, and G.G. Manning CD. Extrapolation Methods
for Accelerating PageRank Computations. in Proceedings of
WWW'03. 2003, p. 261-270, Budapest, Hungary.
33. Page, L., Brin S, Motwani R, Winograd T, The Pagerank Citation
Ranking: Bringing Order to the Web, in Report. 1998, Report in
Computer Science Department, Stanford University.
34. Richardson, M. and D. P. The Intelligent Surfer:Probabilistic
Combination of Link and Content Information in PageRank. in 2001
Neural Information Processing Systems Conference (NIPS 2001).
2001, p. 1441-1448, Vancouver, British Columbia, Canada: MIT
Press, Cambridge, MA.
35. Angoss.
[accessed
2003.11.14].
Available
from
Internet:
<http://www.angoss.com/angoss.html/>.
36. Clementine.
[accessed
2005.09.03].
Available
from
Internet:
<http://www.spss.com/clementine/>.
37. MINEit.
[accessed
2001.11.21].
Available
from
Internet:
<http://www.mineit.com/products/easyminer/>.
38. NetGenesis.
[accessed
2003.05.04].
<http://www.netgen.com/>.
92
Available
from
Internet:
39. Pitkow, J.; Bharat, K. 1994a. WEBVIZ: a tool for World-Wide Web
access log analysis, in Proc. of the First International World Wide
Web Conference. 35–41.
40. Cooley, R.; Mobasher, B.; Srivastava, J. 1997a. Grouping Web Page
References into Transactions for Mining World Wide Web Browsing
Patterns, in Proc. of the IEEE Knowledge and Data Engineering
Exchange Workshop (KDEX'97). 2–7.
41. Pirolli, P.; Pitkow, J.; Rao, R. 1996. Silk from a Sow's Ear:
Extracting Usable Structure from the Web, in Proc. of the Human
factors in computing systems: Common ground;CHI 96. 118–125.
42. Faulstich, L.; Spiliopoulou, M.; Winkler, K. 1999. A Data Mining
Analyzing the Navigational Behaviour of Web Users, in Proc. of the
Workshop on Machine Learning User Modeling of the ACAI'99
International Conf. 44–49.
43. Tauscher, L.; Greenberg, S. 1997. How people revisit web pages:
empirical findings and implications for the design of history systems,
in International Journal of Human Computer Studies. 47(1): 97–138.
44. Savola, T.; Brown, M.; Jung, J.; Brandon, B.; Meegan, R.; Murphy,
K.; O'Donnell, J.; Pietrowicz, S. R. 1996. Using HTML, 1043 p.
45. Ansari, S.; Kohavi, R.; Mason, L.; Zheng, Z. 2001. Integrating ECommerce and Data Mining: Architecture sand Challenges, in Proc.
of the Data mining. 27–34.
46. Richard
Dean,
Personalizing
Your
Web
Site,
http://Webbuilder.netscape.com/Webbuilding/pages/Business/Person
al/index.html
93