Download web usage mining architecture – webminer

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining & Pattern Discovery
Paul George
CSE 8331
Submitted to: Dr. M. Dunham
Department of Computer Science
Southern Methodist University, Texas, USA
1
TABLE OF CONTENTS
ABSTRACT....................................................................................................................... 3
INTRODUCTION .............................................................................................................. 4
Web Content Mining....................................................................................................... 6
Agent-Based Approach ............................................................................................... 6
Intelligent Search Agents ........................................................................................ 6
Information Filtering/Categorization ...................................................................... 6
Personalized Web Agents ....................................................................................... 6
Database Approach ..................................................................................................... 6
Multilevel Databases ............................................................................................... 7
Web Query Systems ................................................................................................ 7
Overview of Crawlers ................................................................................................. 7
Personalization ............................................................................................................ 9
Web Usage Mining ....................................................................................................... 10
PATTERN DISCOVERY ............................................................................................... 11
Preprocessing Tasks ...................................................................................................... 11
Data Cleaning............................................................................................................ 11
Transaction Identification ......................................................................................... 11
Discovery Techniques ................................................................................................... 13
Path Analysis ............................................................................................................ 13
Association Rules...................................................................................................... 13
Clustering and Classification .................................................................................... 15
Sequential Patterns .................................................................................................... 16
WEB USAGE MINING ARCHITECTURE – WEBMINER ....................................... 18
Tools ............................................................................................................................. 19
BENEFITS ....................................................................................................................... 20
APPLICATIONS ............................................................................................................. 22
RESEARCH AREAS ..................................................................................................... 24
CONCLUSION................................................................................................................ 25
BIBLIOGRAPHY ............................................................................................................ 26
2
ABSTRACT
Web mining has been and is the focus of many research papers. Web mining can be
classified into three categories: Web content mining which is the process of discovering
information from various resources available on WWW, Web structure mining which is
the process of discovering knowledge from the interconnections of hypertext documents
and Web usage mining which is the process of pattern discovery and analysis. In this
paper Web content mining approaches have been briefly discussed and Web usage
mining has been concentrated upon in the area of pattern discovery. I have also listed
some applications where Web mining could be applied. Also WEBMINER, a system
which is used for Web usage mining has been covered briefly. In the end I have
concluded by listing the issues and the research directions with respect to the topics
covered.
3
INTRODUCTION
Firstly let’s define data mining, Data Mining can be defined as finding hidden
information in a database or exploratory data analysis, data driven discovery, and
deductive learning [1].
Web mining is data mining applied to the World Wide Web i.e. mining of data related to
World Wide Web.
The Web data can be any of the following:
 Web page content
 HTML/XML code
 Data automatically generated which are stored as server access logs, referrer
logs and cookies residing on the client.
 E-commerce transaction data
When data mining is applied to the Web, it can perform several functions like:



Information extraction: This deals with acquiring/interpreting useful
information using the Web data which may lead to Business Intelligence
Resource discovery: This is the discovery of locations of unfamiliar files
on the network which may or may not be relevant.
Generalization: It relates to the discovery of information patterns [4].
Following page shows Web mining classification – Fig. 1.
As we see in Fig. 1, Web mining is divided into Web content mining, Web structure
mining and Web usage mining. In the following sections in this paper I have discussed
the types of approaches in Web content mining and very briefly covered Web structure
mining and concentrated more on pattern discovery in Web usage mining.
4
Web mining can be classified as shown below:
WEB MINING
WEB CONTENT
MINING
Agent Based
Approach
WEB
STRUCTURE
MINING
WEB USAGE
MINING
Database Approach
Fig.1: Web mining classification [3].
5
Web Content Mining
Web content mining can be thought of as a process of information or resource discovery
from resources on WWW [5].
There are two approaches in Web content mining:


Agent-based
Database approaches.
Agent-Based Approach
The agent-based approach involves artificial intelligence systems that can act
autonomously or semi-autonomously on behalf of a particular user, to discover and
organize Web-based information [7].
Agent-based Web mining systems can be classified into three categories:
Intelligent Search Agents
Many intelligent Search agents e.g. Harvest [10], FAQ-Finder [11] are available which
search for information that are relevant by using domain specific characteristics
to interpret and organize the discovered information [7].
The examples of agents mentioned above rely on documents which have domain specific
information, or on “hard coded models of the information sources to retrieve and interpret
documents” [7].
Information Filtering/Categorization
Those agents that belong to this category use IR techniques and characteristics of open
hypertext Web documents to automatically retrieve, filter, and categorize them.
For example, HyPursuit [13] uses semantic information embedded in link structures as
well as document content to create cluster hierarchies of hypertext documents, and
structure an information space [7].
Personalized Web Agents
Another category of Web agents includes those that learn about user preferences or
obtain them (from WWW) and discover sources from web data that correspond to these
preferences, and possibly those of other individuals with similar interests.
Example of such an agent includes the WebWatcher [14] [7].
Database Approach
The database approach focuses on integrating and organizing the heterogeneous and
semi-structured data on the Web into more structured and high-level collections of
resources. Analysis can be then performed on these resources of data using standard
querying mechanisms. Examples of such resources can be relational or object-oriented
databases [8].
6
Multilevel Databases
This idea has been proposed to organize web based information.
According to this idea, the database is organized into hierarchies. The lowest level
contains primitive semi-structured information stored in various Web repositories, such
as hypertext documents whereas as we go higher in the hierarchy the data would be a
generalization of the lower level and organized in structured collections such as relational
or object-oriented databases. Example ARANEUS system [13] extracts relevant
information from hypertext documents and integrates these into higher-level derived Web
Hypertexts which are generalizations of the notion of database views [8].
Web Query Systems
These systems attempt to utilize standard database query languages such as SQL and
even natural language processing and interlace it with the types of queries that are used in
World Wide Web searches. Example W3QL [14] combines structure queries, based on
the organization of hypertext documents, and content queries, based on information
retrieval techniques [8].
Basic content mining is a type of text mining. It extends the work performed by basic
search engines. It can also improve a traditional search engine through various techniques
like analyzing links between pages and user profile.
Overview of Crawlers
Crawlers (spider or robot) is a program that traverses the hypertext structure in the web
[1]. This is used in Web content mining in search engines.
Before I explain how a crawler works let me explain what a Seed URL is.
Seed URL’s: It is the page / pages that the crawler starts with.
How it works: Starting from the Seed URL, the crawler records and saves all links in a
queue. These new pages are in turn searched and their links are saved. As these Robots
search the web, they collect information about each page like keywords and store in
indices for users of the associated search engine [1].
Types of crawlers
Periodic crawler: It is activated periodically. It may visit a certain number of pages and
then stop, build an index, and replace the existing index.
Incremental crawler: It updates as opposed to replacing index.
Focused crawler: This visits pages related to topics of interest.
7
To give the idea where crawlers are used in a search engine, given below is Search
Engine General Architecture [3] which is self explanatory:
Fig. 2: Architecture of a search engine. [3]
8
Personalization
Personalization is another area in Web content mining. In this the contents of a web page
are modified to better fit the desires of the user. The goal here is to entice a current
customer to purchase something he or she may not have thought about purchasing. It
includes such techniques as use of cookies, use of databases and more complex methods
[1].
Personalization can be viewed as type of [1]:



Classification: The behavior of a user is determined based on those for the class.
Clustering: The behavior is determined based on those users to which he/she is
determined to be similar to.
Prediction: Used to predict what the user would like to see.
There are three types of Web page personalization [1]:



Manual Techniques: This works by using details of user registration preferences
or through the use of rules that are based on profiles to classify individuals.
Collaborative filtering: In this personalization is achieved by recommending
information that has previously been given high ratings from similar users.
Content-based filtering: This retrieves pages based on similarity between them
and user.
9
Web Usage Mining
This is the process of discovering user access patterns (or user habits), as data are
automatically collected in daily access logs [4].
Web usage mining performs mining on web usage data or web logs.
A web log is a listing of page reference data [1]. It is at times referred to as clickstream
data as each entry corresponds to a mouse click. Logs can be examined from two
perspectives [1]:
 Server: It can be used to find where a web site resides and also improve the
design.
 Client: By evaluating a client’s sequence of clicks, information about a user
(group of users) is detected. This information could be used to perform
prefetching and caching of pages and hence easy and fast loading of web pages.
Analyzing logs can help organizations to effectively market their products by effective
promotional campaigns, advertisements. It can also provide important information how to
restructure a web site for effective organizational presence[7].
Web usage mining can be divided into two parts. Firstly pattern discovery which I have
covered in the following section and secondly pattern analysis [7].
Web log/Web usage mining can be used for the following [3]:
 To enhance server performance
 To improve Web-site navigation
 Target customers for electronic commerce
 Identify potential prime advertisements locations
10
PATTERN DISCOVERY
As we have previously discussed the various uses of Web usage mining, there are various
issues in pre-processing data for mining that must be taken care of before the mining
algorithms can be run. These include developing a model of access log data, developing
techniques to clean/filter the raw data to eliminate outliers and/or irrelevant items,
grouping individual page accesses into transactions, integration of various data sources
such as user registration information, and specializing generic data mining algorithms to
take advantage of the specific nature of access log data [7].
Preprocessing Tasks
This can be divided into two activities:


Data Cleaning
Transaction Identification
Data Cleaning
This is one of the most important activities as presence of irrelevant data during analysis
phase may yield wrong results. Elimination of irrelevant items can be reasonably
accomplished by checking the suffix of the URL name. For instance, all log entries with
filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG, and map can be removed [7].
It may happen that a page may be accessed many time but listed only once. Recent
methods which make use of cookies, cache busting, and explicit user registration try to
overcome this problem. As detailed in [17], all the methods have drawbacks. Cookies can
be deleted by the user, cache busting defeats the speed advantage that caching was
created to provide and can be disabled, and user registration is voluntary and users often
provide false information. User identification is also another problem. Machine names
cannot be used, as it is possible for more than one user to be behind that name (here name
applies to the IP address of the machine) e.g. a company network have a single IP but
several people may be accessing the same URL. [18] Discusses algorithm to check if
each incoming request is reachable from the pages already visited. If a page is requested
that is not directly linked to the previous pages, multiple users are assumed to exist on the
same machine [7]. Other methods to identify users are IP address, browser agent,
temporal information which are discussed in [17].
Transaction Identification
The raw server log can be thought of as in two ways; either as a single transaction of
many page references or set of many transactions each of single page reference.
Sequences of page references must be grouped into logical units representing Web
transactions or user sessions. A user session is all of the page references made by a user
during a single visit to a site. A transaction differs from a user session in that the size of a
11
transaction can range from a single page reference to the entire page references in a user
session, depending on the criteria used to identify transactions [7].
The goal of transaction identification is to create meaningful clusters of references for
each user. Thus, the task can be dividing large transaction into multiple smaller ones to
merging small transactions into many large ones. [19]
Two types of transactions are defined. The first type is navigation-content, where each
transaction consists of a single content reference and all of the navigation references in
the traversal path leading to the content reference. These transactions can be used to mine
for path traversal patterns. The second type of transaction is content-only, which consists
of all of the content references for a given user session. These transactions can be used to
discover associations between the content pages of a site [19].
A given page reference is classified as either navigational or content, based on the time
spent on the page. This kind of page typing is further delineated in [18], where various
page types such as index pages, personal home pages, etc. are used in the discovery of
user patterns. [19]
A general transaction definition can be got from [19]. In that it assumes each transaction
is made up of references from only one user. This will hence be a merge activity. Once
the log has been arranged according to the general transaction definition we can apply
one of the three divide modules to the log: reference length, maximal forward reference
and time window.
The first two methods identify transactions based on User browsing behavior model. User
browsing behavior model uses the concept of navigation and content page. According to
this model there are two ways of defining a transaction. One is to define a transaction as
all navigation references up to and including each content reference for a given user. The
second would be to define a transaction as all of content-references. Time window is used
a benchmark to compare with the other two algorithms. [19]
Reference length module is based on the assumption that the amount of time a user
spends on a page correlates to whether the page should be classified as a navigation or
content page for that user. [19]
In Maximal forward reference module each transaction is defined to be a set of pages in
the path from the first page up the log up to the page a backward reference is made. A
new transaction is started when the next forward reference is made. “A forward reference
is defined to be a page not already in the set of pages for the current transaction”.
Similarly, a backward reference is defined to be a page that is already contained in the set
of pages for the current transaction. [19]
“Time window transaction identification module divides the log for a user into time
intervals no larger than a specified parameter”. [19] Discusses the three modules in detail.
12
Discovery Techniques
Once the log has been divided into transactions, techniques can be applied to perform
pattern mining. Some of them are listed below:
 Path Analysis
 Association Rules
 Clustering and Classification
 Sequential Patterns
Path Analysis
To perform pattern discovery path analysis makes use of graphs. A graph represents some
relation defined on Web pages. Thus, for a Web site, a graph may be drawn with Web
pages as nodes and hypertext links between pages as directed edges [7].
The navigation-content transactions of [19], maximal forward reference transactions of
[20], or user sessions of [18] can be used for path analysis some of which has been briefly
introduced above. Path analysis could be used to determine most frequently visited paths
in a Web site. [7]
[18] Discusses approaches to extracting structure in the Web which can be used to form
higher level abstractions that reduce the complexity and increase the richness of an
overview. The approach discussed in [18] is based on graph structure representations of
similarity, or strength of association and relations. Three types of graph structures are
discussed in this. The first type of graph structure represents the link topology of a Web
locality by using arcs labeled with unit strengths to connect one node to another where
the node represents the page and an arc represents the hyperlink connecting the two
pages. (A Web locality may be thought of as a “complex abstract space in which we have
arranged Web pages of different functional categories or types”). The second type of
graph structure represents inter-page text content similarity by labeling arcs connecting
nodes with the computed text similarities between corresponding Web pages. The third
type represents flow of users through the locality by labeling the arcs between two nodes
with the number of users that go from one page to another. [18]
Examples of information that can be discovered through path analysis are [7]:
 75% of clients who accessed /company/product2 did so by starting at /company
and
proceeding
through
/company/new,
/company/products
and
/company/product1: This suggests that there is useful information in product2
page, but the path to the page is not clear.
 65% of the clients left the site after four or less page references.
This suggests that more information should be provided in the first four pages
from the points of entry into the site.
Association Rules
The problem of mining association rules is to find all rules that satisfy a user-specified
minimum support and minimum confidence. An association rule is of the form X => Y
13
where X,Y are set of items and X ∩ Y is null. The support of an item (or set of items) is
the percentage of transactions in which that item (or items) occurs. It is denoted by s.
Confidence for an association rule X => Y is the ratio of the number of transactions that
contain X U Y to the number of transactions that contain X. Confidence measures the
strength of the rule while support measures how often it should occur in the database. It is
denoted by α. Applications include cross-marketing, catalog design, store layout and
customer segmentation based on buying patterns. Also, when we are dealing with Web
mining, we would mostly be working with transaction logs rather than databases. [1][23]
Association rule discovery techniques are generally applied to databases of transactions
where each transaction consists of a set of items. The difficulty here is to discover all
associations and correlations among data items where the presence of one set of items in
a transaction implies (with a certain degree of confidence) the presence of other items. In
the context of Web mining, this problem amounts to discovering the correlations among
references to various files available on the server by a given client. Each transaction is
comprised of a set of URLs accessed by a client in one visit to the server. [7]
Since usually such transaction databases/logs contain extremely large amounts of data,
current association rule discovery techniques try to prune the search space according to
‘support’ for items under consideration. [7]
The problem of discovering generalized association rules can be divided into three parts
[23]:
1. Find all sets of items (itemsets) whose support is greater than user-specified
minimum support. Itemsets with minimum support are called frequent itemsets.
2. Use the frequent itemsets to generate the desired rules.
3. Prune all uninteresting rules from this set.
There are several algorithms to find association rules. I will be reviewing some of the
algorithms which have been found to improve on the “Basic” algorithms (e.g. Apriori
algorithm).
In most cases, taxonomies (is-a hierarchies) over the items are available. Earlier work on
association rules only took into account leaf-level items in the taxonomy while not
considering presence of taxonomies. Finding rules at different levels could be valuable
since [23]:


Rules at lower levels may not have minimum support.
Taxonomies can be used to prune uninteresting or redundant rules.
[23] Presents few algorithms which run 2 to 5 times faster than the “Basic” algorithm
(Apriori algorithm[21]). The algorithms were also seen to perform 100 times faster on
real-life dataset.
14
“Cumulate” algorithm [23] was developed by adding optimizations to the Basic
algorithms. “The name itself indicates that all itemsets of a certain size are counted in one
pass”. “Stratify” algorithm [23] is developed assuming the optimizations of Cumulate
algorithm is applicable. Different versions of “Stratify” i.e. “Estimate” and “EstMerge” is
also explained. “Basic”, “Cumulate” and “EstMerge” have also been compared in [23] on
the basis of minimum support, number of transactions, fan-out, number of roots, number
of items/levels and depth ratio. Details about these algorithms have been studied in [23].
[26] Discusses “FDM (Fast Distributed Mining)” algorithm for generation association
rules in a distributed environment. “FDM generated a small number of candidate sets and
substantially reduces the number of messages to be passed at mining association rules”.
[26] Discusses three versions FDM and also “Count Distribution” algorithm for parallel
mining. More information and comparisons on these algorithms has been studied in [26].
[21] Shows the best features of algorithms like “Apriori”, “AprioriTid” and a hybrid of
both “AprioriHybrid”.
Shown below are examples using association rule discovery techniques using which we
can find correlations such as the following [7]:


55% of clients who accessed the Web page with URL page1.html, also accessed
page2.html; or
25% of clients, who accessed /company/offer.html, placed an online order in
/company/offer/product1.
Discovery of such rules for organizations engaged in electronic commerce can help in the
development of effective marketing strategies. But, in addition, association rules
discovered from WWW access logs can give an indication of how to best organize the
organization's Web space. [7]
Clustering and Classification
Discovering classification rules [29] allows one to “develop a profile of items belonging
to a particular group according to their common attributes”. This “profile can then be
used to classify new data items that are added to the database”. In Web mining,
classification techniques allow one to develop a profile for clients who access particular
server files based on demographic information available on those clients, or based on
their access patterns. For example classification on WWW access logs may lead to the
discovery of relationships such as the following [7]:


clients from state or government agencies who visit the site tend to be interested
in the page /company/products/product1.html; or
40% of clients, who placed an online order in product 2 from Web page
/company/products/product2.html, were in the 20-25 age group and lived on
the West Coast.
15
In some cases, information about clients can be captured by servers from the client
browsers. This includes information available on the client side in the history files,
cookie files, etc. Other methods used to obtain profile and demographic information on
clients includes user registration, online survey forms etc. [7]
[29] Discusses a fast scalable classifier for data mining. According to this most of the
classification algorithms are designed only for memory-resident data, thus limiting their
suitability for data mining large data sets. “SLIQ” is presented in [29]. SLIQ stands for
“Supervised Learning in Quest, where Quest is a data mining project at IBM Almaden
research center”. “SLIQ is a decision tree classifier that can handle both numeric and
categorical data”. SLIQ is studied in detail in [29].
“Clustering analysis allows one to group together clients or data items that have similar
characteristics”. Clustering of client information or data items on Web transaction logs,
can assist in marketing strategies, both online and off-line, such as automated return mail
to clients falling within a certain cluster, or dynamically changing a particular site for a
client, on a return visit, based on past classification of that client. [7]
Sequential Patterns
Sequential patterns are used to find patterns in the transaction logs/databases such that the
presence of a set of items is followed by another item in the time-stamp ordered
transaction set. In Web server transaction logs, a visit by a client is recorded over a period
of time. The “time stamp associated with a transaction in this case will be a time interval
which is determined and attached to the transaction during the data cleaning or
transaction identification processes.” The discovery of sequential patterns in Web server
access logs allows Web-based organizations to predict user visit patterns and helps
organization in targeting advertising to specific users. [7]
[27] Discusses problems of recognizing frequent episodes in sequences of events
occurring in transaction logs. An episode is defined as “a collection of events that occur
within a given size in a partial order”. Moreover the individual page access must occur
within a given time frame. Episodes can be classified into serial, parallel and general.
Serial is one in which events are totally ordered. In parallel there need not be any
particular order but they do satisfy the time constraint. General satisfies some partial
order. [27][1]
The time frame is specified by the user within which the episode must occur. [27]
Discusses an algorithm for finding all frequent episodes. Also studied in [27] is how to
recognize serial and parallel episodes.
[28] Discusses three algorithms i.e. “AprioriSome”, “AprioriAll” and “DynamicSome”.
All these are used for mining sequential data. [28] mostly concentrates on the problem of
finding text sequences that match a given regular expression. Each algorithm can be
divided into five basic stages .i.e. sort phase, itemset phase, transformation phase,
16
sequence phase and maximal phase. These algorithms are closely studied and compared
in [28]
By analyzing this information, the Web mining system can determine temporal
relationships among data items such as the following [7]:

50% of clients who visited /search/topics/, had done a search in Yahoo,
within the past week on keyword mining
An example for determining similar time sequences using sequential patterns can be
thought of as the need to find common characteristics of all clients that visited a
particular file within the time period [t1,t2] or we may like to know which is the most
accessed page in the week.[7]
I have also found that there are two other techniques that can be applied during pattern
discovery phase. They are Statistical Analysis[30] and Dependency Modeling[30]. They
are studied in [30]
Following pattern discovery is pattern analysis, but I will not be covering it in this paper.
It is studied in detail in [7].
17
WEB USAGE MINING ARCHITECTURE – WEBMINER
A general architecture for Web usage mining is presented in [7] [9]. The
WEBMINER is a system that implements parts of this general architecture. The
first part is domain dependent application. The second part is the domain
independent application. This includes pattern discovery and analysis as part of
the system's data mining engine. The overall architecture for the Web mining
process is depicted below:[7]
Figure 3: A General Architecture for Web Usage Mining [7]
Just to briefly explain the above figure, data cleaning is the first step performed in the
Web usage mining process. We have discussed some of the techniques to clean the log
data. “Currently, the WEBMINER system uses the simplistic method of checking
filename suffixes. Some low level data integration tasks may also be performed at this
stage, such as combining multiple logs, incorporating referrer logs, etc”. [7][9]
After the data cleaning using one or a series of transaction identification modules into
clusters. We have discussed a few techniques how to separate data into transactions. The
“WEBMINER system currently has reference length, maximal forward reference, and
time window divide modules, and a time window merge module”. [7]
“Access log data may not be the only source of data for the Web mining process. User
registration data, for example, is playing an increasingly important role, particularly as
more security and privacy conscious client-side applications restrict server access to a
variety of information, such as the client user IDs”. The data collected through user
registration is then integrated with the access log data. There are also known or
discovered attributes of references pages that could be integrated into a higher level
database schema. The discovered attributes could include page types, usage frequency
18
and link structures. “While WEBMINER currently does not incorporate user registration
data, various data integration issues are being explored in the context of Web usage
mining”. [7][9]
“In WEBMINER, a simple Query mechanism has been implemented by adding some
primitives to an SQL-like language”. This allows the user to specify his patterns of
interest to the mining engine. [7]
This information from the query is used to reduce the scope, and thus the cost of the
mining process. The development of a more general query mechanism along with
appropriate Web-based user interfaces and visualization techniques, are still in
research.[7]
Tools
Some of the newest web data mining tools from Megaputer are: WebAnalyst and XSellAnalyst.
More details can be got from [31]
19
BENEFITS
Let’s have a look at some of the benefits you get from Web mining [31]:
Match your available resources to visitor interests:
Resources can be products you “sell, information fragments you distribute online, banner
ads from your client advertisers, e-mail fragments from a mailing list, or anything else”
which is distributed online. “Metadata of these resources are then stored in a database.
WebAnalyst helps learn visitor interests by collecting and analyzing information
generated by interactions with your website, such as clickstream data, search requests,
and cookies. WebAnalyst can use the gleaned knowledge to rank your resources by their
relevance to the user's interests. Servicing a user request for information, with the best
matching resources, results in a higher visitor-to-customer conversion rate for your ebusiness”. [31]
Increase the value of each visitor
Upon carrying out collaborative filtering, we can predict what kind if information a
visitor may be interested in, and the products she might consider purchasing. “These
predictions used to present the visitor with related products and resources”, and hence
chances of them purchasing it. “This knowledge significantly increases the value of a
customer for an e-business when used in individualized cross-selling and up-selling
promotions, and thus increasing revenue.” [31]
Improve the visitor's experience at the website
“A sound combination of data and text mining techniques can help determine user
interests - early in the process of the visitor's interaction with the website. This allows the
website to act interactively and proactively and deliver the most relevant customized
resources to the visitor”. In the world of Internet, easy access to relevant information
might make a difference between a profitable customer and lost opportunity. “By
increasing the customer's satisfaction, you reduce attrition and build brand loyalty”. [31]
Perform targeted resource management
Since, all visitors are different in buying behavior you may notice that some of them are
your best potential customers, ready to click and buy, while others are prospecting for
information, simultaneously familiarizing themselves with your brand. These prospecting
customers may become “very important and profitable customers” in the future. Also not
to forget there is another group of visitors who enjoy only free rides. “These folks will
use promotional resources that you offer to the fullest extent, but will never purchase
anything. All these visitors come through a single pipe to your website and are in a
common queue for your website resources”. It’s best to your advantage if you can tell
each type of visitor apart from the other. “Your website performance is limited and you
20
might want to prioritize requests coming from your best prospects. If you are distributing
promotional resources of high value, you might want to spend your promotional budget
wisely by offering and delivering your promotional materials only to your best prospects
- not to every Web surfer on the planet. WebAnalyst can work with load-balancing
products to provide the best quality of service to your best customers”. [31]
Collect information in new ways
“While for the majority of e-vendors the task of collecting data is just an intermediate
step necessary for better targeting their marketing, for others this task might be the main
motivation for creating a website itself”. Traditional data collection methods like
promotions, surveys, focus groups, etc. have many well known problems, including high
cost, poor response rates and low accuracy. “Now imagine that you can offer your
promotional items online through a content-rich website, where visitors can find useful
information in addition to submitting their contact information and requesting the
promotion. WebAnalyst can learn the visitor's preferences (at virtually no cost) based on
the content that the user was browsing. Of course, WebAnalyst is designed to work handin-hand with your privacy management system, allowing you to collect valuable data
while respecting the privacy of your visitors”. [31]
Test the relevance of content and web site architecture
Perhaps you would like to increase usability, or optimize your website for the eyes of
your best prospects by taking close look at the website’s content and architecture. “Log
analyzers can help you visualize the most navigated paths through your website, averaged
over all visitors. When optimizing your website structure, your main concern should be to
improve experience of your most promising prospects, and not just everybody. Roughly
15% of your website visitors comprise really valuable prospects. The remaining 85%
have little value to you other than sustaining the brand recognition traffic. Thus you have
to segregate your least important prospects and subtract their contribution from the
overall picture of the site navigation. What is left represents the real quality of your
website. This is the picture that can help you really improve your bottom line”.[31]
21
APPLICATIONS
Some of the applications of Web mining is shown below [31]:
E-tailers: Includes: B2B and B2C Ventures
WebAnalyst is easily applied to any B2C or B2B e-tailing scheme. Any company that
profits by selling goods or services via the Web may benefit from WebAnalyst's ability to
find new cross-sell opportunities, enable comprehensive prospect profiling, and improve
customer satisfaction. [31]
Advertising-Based Sites: Includes: Entertainment sites, Media Portals, Advertising
Providers
When your revenue is advertising-based, you know that blindly serving ads to visitors
will not result in a large click-thru rate. Instead, ads must be intelligently targeted to the
user, providing the visitor with products and services that they are interested in.
WebAnalyst's data collection services give you customizable access to every bit of
information that passes between the visitor and the server. Next, its cutting-edge data
mining modules analyze the data and match the visitor's profile with the ads that they are
most interested in. Finally, WebAnalyst's long-term data mining modules find new
patterns and refine existing models to improve response rates even further. [31]
Information Repositories: Includes: Libraries, Technical Support Sites, Media Sites,
Content Providers
Information overload is a problem that grows larger every day. You would like to use
your staff for content creation, yet you find yourself spending exponentially more time on
indexing, summarization, and other metadata tasks. WebAnalyst's semantic text analysis
capabilities can automate these tasks, and create user navigation systems on the fly. [31]
Web Integrators: Includes: Web Development, Consulting, and ASPs
If you are a web integrator looking to provide web intelligence solutions, WebAnalyst
may be exactly what you are looking for. WebAnalyst is a flexible, modular system that
is designed for customization.[31]
Web content mining can be used for “discovering unexpected information from your
competitors' web sites” [32]:

Finding unexpected information is “useful in many applications.
For example, it is useful for a company to find unexpected
information about its competitors, e.g., unexpected services and
products that its competitors offer”.
22

It is very difficult for a human user to view each page to discover
the unexpected information. Automated assistance is needed. [32],
proposes a number of methods to help the user find various types of
unexpected
information
from
his/her
competitors
Web
sites.
Experiment results show that these techniques are very useful in
practice and also efficient.
We can also use Neural Network for Web content filtering [33]:


Identifying what is the pornography web site can be solved, hence restricting
access.
Methods like “PICS,URL blocking, Term filtering, Intellegent Content Analysis
and Intellegent Content Analysis” are discussed in [33]
There are many applications, but only a few of them have been discussed above.
Web-Based Applications which use Web mining can be listed below [34]:
- Business Intelligence
- Computational Societies and Markets
- Conversational Systems
- Customer Relationship Management (CRM)
- Direct Marketing
- Electronic Commerce and Electronic Business
- Electronic Library
- Information Markets
- Price Dynamics and Pricing Algorithms
- Measuring and Analyzing Web Merchandising
- Web-Based Decision Support Systems
- Web-Based Distributed Information Systems
- Web-Based EDI
- Web-Based Learning Systems
- Web Marketing
- Web Publishing
23
RESEARCH AREAS
The area of Web mining is still new. Hence, there are still a lot of improvements to be
done in various techniques used in Web mining. The techniques being applied to Web
content mining draw heavily from the work on information retrieval, databases,
intelligent agents, etc. [7]
Important area of ongoing research is to continue to develop methods of clustering log
entries into user transactions, including using criteria such as time differential among
entries, time spent on a page relative to the page size, and user profile information
collected during user registration.[19]
Association rules did not consider the quantities of the items bought in a transaction,
which are useful for some applications. [21]
Techniques have to be developed which would remove uninteresting rules which are
generated along with interesting rules. [23]
Open issues that exist is the creation of intelligent tools that can assist in the
interpretation of mined knowledge Also there needs to be an approach where various logs
can be integrated together into a more comprehensive model.[7]
Another issue is the privacy issue. “The issue revolves around the fact that most users
want to maintain strict anonymity on the Web. On the other hand site administrators are
interested in finding out the demographics of users as well as the usage statistics of
different sections of their website. W3C has an ongoing initiative for Platform for Privacy
Preferences (P3P). P3P provides a protocol which allows the site administrators to
publish the privacy policies followed by a site in a machine readable format. When the
user visits the site for the first time the browser reads the privacy policies followed by the
site and then compares that with the security setting configured by the user. If policies are
satisfactory the browser continues requesting pages from the site, otherwise negotiation
protocol is used to arrive at a setting which is acceptable to the user”. [30]
Other research areas include Web semantic mining and Web farming.
24
CONCLUSION
In this paper I have discussed Web mining and its classifications i.e. content mining and
usage mining. The two types of content mining have been looked into. Also I have given
a small overview about crawlers and how they work. Web usage mining has been
covered in a greater depth compared to content mining. The four techniques i.e. path
analysis, association rules, classification and clustering, sequential patterns have been
covered. Also I have briefed upon the architecture of Web usage mining. Finally I have
concluded by listing down the benefits of Web mining and its applications followed by
issues of topics covered and interesting research areas in this field.
25
BIBLIOGRAPHY
1 : Data mining :Introductory and Advanced Topics - Margaret Dunham (Text Book)
2 : Yan Wang. Web mining and knowledge discovery of usage patterns - A survey.
http://db.uwaterloo.ca/~tozsu/courses/cs748t/surveys/wang-slides.pdf
3 : Oleksandr Romanko. McMaster University. Web mining.
http://www.cas.mcmaster.ca/~cs4tf3/romanko_slides.pdf
4 : May Chau. Web mining technology and academic librarianship : Human-Machine
connections for the twenty first century.http://www.firstmonday.dk/issues/issue4_6/chau/
5 : Cooley, Bamshad and Jaideep, 1997. op.cit. at
http://www-users.cs.umn.edu/~mobasher/webminer/survey/survey.html
6 : Oren Etzioni, 1996. "The World Wide Web: Quagmire or Gold Mine,"
Communications of the ACM, volume 36, number 11 (November), pp. 65-68.
7 : R. Cooley, Bamshad Mobasher and J. Srivastava 1997. Web Mining : Information and
Pattern Discovery on the World wide Web.
8 : A. Joshi. Web mining : http://www.cs.umbc.edu/~ajoshi/web-mine/
9 : Bamshad Mobasher. WEBMINER: A System for Pattern Discovery from World
Wide Web Transactions. http://maya.cs.depaul.edu/~mobasher/Research-01.html
10 : BDH 94 C. M. Brown, B. B. Danzig, D. Hardy, U. Manber, and M. F. Schwartz.
The harvest information discovery and access system. In Proc. 2nd International World
Wide Web Conference, 1994.
11 : HBML95 K. Hammond, R. Burke, C. Martin, and S. Lytinen. Faq-finder: A casebased approach to knowledge navigation. In Working Notes of the AAAI Spring
Symposium: Information Gathering from Heterogeneous, Distributed Environments.
AAAI Press, 1995.
12 : WVS 96 R. Weiss, B. Velez, M. A. Sheldon, C. Namprempre, P. Szilagyi, A. Duda,
and D. K. Gifford. Hypursuit: a hierarchical network search engine that exploits contentlink hpertexxt clustering. In Hypertext'96: The Seventh ACM Conference on Hypertext,
1996.
13 : PA97 P. Merialdo P. Atzeni, G. Mecca. Semistructured and structured data in the
web: Going back and forth. In Proceedings of the Workshop on the Management of
Semistructured Data (in conjunction with ACM SIGMOD), 1997.
26
14 : KS95 D. Konopnicki and O. Shmueli. W3qs: A query system for the world wide
web. In Proc. of the 21th VLDB Conference, pages 54--65, Zurich, 1995.
15 : Hypertext and information retrieval and Web mining.
http://www.cyberartsweb.org/cpace/ht/lanman/wm1.htm
16 : [Madria 1999] S.K.Madria, S.S.Rhowmich, W.K.Ng, and F.P.Lim. Research issues
in Web data mining. In Proceedings of Data Warehousing and Knowledge Discovery,
First International Conference. DaWaK'99, pages 303-312, 1999.
17 : Pit97 J. Pitkow. In search of reliable usage data on the www. In Sixth International
World Wide Web Conference, pages 451--463, Santa Clara, CA, 1997.
18 : PPR96 P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable
structures from the web. In Proc. of 1996 Conference on Human Factors in Computing
Systems (CHI-96), Vancouver, British Columbia, Canada, 1996
19 : CMS97 R. Cooley, B. Mobasher, and J. Srivastava. Grouping web page references
into transactions for mining world wide web browsing patterns. Technical Report TR 97021, University of Minnesota, Dept. of Computer Science, Minneapolis, 1997.
20 : CPY96 M.S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in
a web environment. In Proceedings of the 16th International Conference on Distributed
Computing Systems, pages 385--392, 1996.
21 : AS94 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In
Proc. of the 20th VLDB Conference, pages 487--499, Santiago, Chile, 1994.
22 : HS95 M. A. W. Houtsma and A. N. Swami. Set-oriented mining for association rules
in relational databases. In Proc. of the 11th Int'l Conf. on Data Eng., pages 25--33,
Taipei, Taiwan, 1995.
23 : SA95 R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of
the 21th VLDB Conference, pages 407--419, Zurich, Switzerland, 1995.
25 : Benefits of Web mining
http://www.dmreview.com/editorial/dmdirect/012800_doherty.htm
26 : A Fast Distributed Algorithm for Mining Association Rules. David W. Cheung,
Jiawei Han, Vincent T. Ng, Ada W. Fu Yongjian Fu.
27 : MTV95 H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes
in sequences. In Proc. of the First Int'l Conference on Knowledge Discovery and Data
Mining, pages 210--215, Montreal, Quebec, 1995.
27
28 : SA96 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. In Proc. of the Fifth Int'l Conference on Extending Database
Technology, Avignon, France, 1996.
29 : MAR96 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for
data mining. In Proc. of the Fifth Int'l Conference on Extending Database Technology,
Avignon, France, 1996.
30 : R. Cooley, Mukund Deshpande, Pang-Ning Tan and J. Srivastava 1997. Web Usage
Mining: Discovery and Applications of Usage Patterns from the Web Data.
31 : Benefits of Web data mining : http://www.megaputer.com/products/wa/benefits.php3
32 : Bing Liu, Yiming Ma, Philip S. Yu. Discovering Unexpected Information from Your
Competitors’ Web Sites (2001)
33 : Applications
http://neuron.et.ntust.edu.tw/homework/91/NN/91NNHomework2/Web%20mining/web_
app.htm
34 : Web applications : http://mail.cs.uiuc.edu/pipermail/colt/2000-October/000164.html
28