Download Web mining is searches for

Document related concepts
no text concepts found
Transcript
1. INTRODUCTION
The World Wide Web is without doubt, already know what and have used it
extensively. The World Wide Web (or the Web for short) has impacted on almost every
aspect of our lives. It is the biggest and most widely known information source that is
easily accessible and searchable. It consists of billions of interconnected documents
(called Web pages) which are authored by millions of people. Since its inception, the Web
has dramatically changed our information seeking behaviour. Before the Web, finding
information means asking a friend or an expert, or buying/borrowing a book to read.
However, with the Web, everything is only a few clicks away from the comfort of our
homes or offices. Not only can we find needed information on the Web, but we can also
easily share our information and knowledge with others.
The Web has also become an important channel for conducting businesses. We can
buy almost anything from online stores without needing to go to a physical shop. The Web
also provides convenient means for us to communicate with each other, to express our
views and opinions on anything, and to discuss with people from anywhere in the world.
The Web is truly a virtual society. In this chapter, we introduce the Web, its history, and
the topics that we will discuss in the seminar.
1.1 WHAT IS THE WORLD WIDE WEB?
The World Wide Web is officially defined as a “wide-area hypermedia information
retrieval initiative aiming to give universal access to a large universe of documents.” In
simpler terms, the Web is an Internet-based computer network that allows users of one
computer to access information stored on another through the world-wide network called
the Internet.
The Web's implementation follows a standard client-server model. In this model, a
user relies on a program (called the client) to connect to a remote machine (called the
server) where the data is stored. Navigating through the Web is done by means of a client
program called the browser, e.g., Netscape, Internet Explorer, Firefox, etc. Web browsers
work by sending requests to remote servers for information and then interpreting the
returned documents written in HTML and laying out the text and graphics on the user’s
computer screen on the client side.
WEB MINING
Page 1
The operation of the Web relies on the structure of its hypertext documents.
Hypertext allows Web page authors to link their documents to other related documents
residing on computers anywhere in the world. To view these documents, one simply
follows the links (called hyperlinks).
The idea of hypertext was invented by Ted Nelson in 1965, who also created the
well known hypertext system Xanadu (http://xanadu. com/). Hypertext that also allows
other media (e.g., image, audio and video files) is called hypermedia.
1.2 A BRIEF HISTORY OF THE WEB AND THE INTERNET
CREATION OF THE WEB:
The Web was invented in 1989 by Tim Berners-Lee, who, at that time, worked at
CERN (Centre European pour la Recherche Nucleaire, or European Laboratory for
Particle Physics) in Switzerland. He coined the term “World Wide Web,” wrote the first
World Wide Web server, httpd, and the first client program (a browser and editor),
WORLD WIDE WEB:
It began in March 1989 when Tim Berners-Lee submitted a proposal titled
“Information Management: A Proposal” to his superiors at CERN. In the proposal, he
discussed the disadvantages of hierarchical information organization and outlined the
advantages of a hypertext-based system. The proposal called for a simple protocol that
could request information stored in remote systems through networks, and for a scheme by
which information could be exchanged in a common format and documents of individuals
could be linked by hyperlinks to other documents. It also proposed methods for reading
text and graphics using the display technology at CERN at that time. The proposal
essentially outlined a distributed hypertext system, which is the basic architecture of the
Web.
Initially, the proposal did not receive the needed support. However, in 1990,
Berners-Lee re-circulated the proposal and received the support to begin the work. With
this project, Berners-Lee and his team at CERN laid the foundation for the future
development of the Web as a distributed hypertext system. They introduced their server
and browser, the protocol used for communication between clients and the server, the
WEB MINING
Page 2
Hyper Text Transfer Protocol (HTTP), the Hyper Text Markup Language (HTML) used
for authoring Web documents, and the Universal Resource Locator (URL). And so it
began.
MOSAIC AND NETSCAPE BROWSERS:
The next significant event in the development of the Web was the arrival of
Mosaic. In February of 1993, Marc Andreesen from the University of Illinois’ NCSA
(National Center for Supercomputing Applications) and his team released the first
"Mosaic for X" graphical Web browser for UNIX. A few months later, different versions
of Mosaic were released for Macintosh and Windows operating systems. This was an
important event. For the first time, a Web client, with a consistent and simple point-andclick graphical user interface, was implemented for the three most popular operating
systems available at the time. It soon made big splashes outside the academic circle where
it had begun. In mid-1994, Silicon Graphics founder Jim Clark collaborated with Marc
Andreessen, and they founded the company Mosaic Communications (later renamed as
Netscape Communications). Within a few months, the Netscape browser was released to
the public, which started the explosive growth of the Web. The Internet Explorer from
Microsoft entered the market in August, 1995 and began to challenge Netscape.
The creation of the World Wide Web by Tim Berners-Lee followed by the release
of the Mosaic browser are often regarded as the two most significant contributing factors
to the success and popularity of the Web.
INTERNET:
The Web would not be possible without the Internet, which provides the
communication network for the Web to function. The Internet started with the computer
network ARPANET in the Cold War era. It was produced as the result of a project in the
United States aiming at maintaining control over its missiles and bombers after a nuclear
attack. It was supported by Advanced Research Projects Agency (ARPA), which was part
of the Department of Defense in the United States. The first ARPANET connections were
made in 1969, and in 1972, it was demonstrated at the First International Conference on
Computers and Communication, held in Washington D.C. At the conference, ARPA
scientists linked computers together from 40 different locations.
WEB MINING
Page 3
In 1973, Vinton Cerf and Bob Kahn started to develop the protocol later to be
called TCP/IP (Transmission Control Protocol/Internet Protocol). In the next year, they
published the paper “Transmission Control Protocol”, which marked the beginning of
TCP/IP. This new protocol allowed diverse computer networks to interconnect and
communicate with each other. In subsequent years, many networks were built, and many
competing techniques and protocols were proposed and developed. However, ARPANET
was still the backbone to the entire system. During the period, the network scene was
chaotic. In 1982, the TCP/IP was finally adopted, and the Internet, which is a connected
set of networks using the TCP/IP protocol, was born.
SEARCH ENGINES:
With information being shared worldwide, there was a need for individuals to find
information in an orderly and efficient manner. Thus began the development of search
engines. The search system Excite was introduced in 1993 by six Stanford University
students. EINet Galaxy was established in 1994 as part of the MCC Research Consortium
at the University of Texas. Jerry Yang and David Filo created Yahoo! in 1994, which
started out as a listing of their favourite Web sites, and offered directory search. In
subsequent years, many search systems emerged, e.g., Lycos, Inforseek, AltaVista,
Inktomi, Ask Jeeves, Northernlight, etc.
Google was launched in 1998 by Sergey Brin and Larry Page based on their
research project t at Stanford University. Microsoft started to commit to search in 2003,
and launched the MSN search engine in spring 2005. It used search engines from others
before. Yahoo! provided a general search capability in 2004 after it purchased Inktomi in
2003.
W3C (THE WORLD WIDE WEB CONSORTIUM):
W3C was formed in the December of 1994 by MIT and CERN as an international
organization to lead the development of the Web. W3C's main objective was “to promote
standards for the evolution of the Web and interoperability between WWW products by
producing specifications and reference software.” The first International Conference on
World Wide Web (WWW) was also held in 1994, which has been a yearly event ever
since. From 1995 to 2001, the growth of the Web boomed. Investors saw commercial
WEB MINING
Page 4
opportunities and became involved. Numerous businesses started on the Web, which led to
irrational developments. Finally, the bubble burst in 2001. However, the development of
the Web was not stopped, but has only become more rational since.
1.3 WEB DATA MINING
The rapid growth of the Web in the last decade makes it the largest publicly
accessible data source in the world. The Web has many unique characteristics, which
make mining useful information and knowledge a fascinating and challenging task. Let us
review some of these characteristics.
1. The amount of data/information on the Web is huge and still growing. The coverage of
the information is also very wide and diverse. One can find information on almost
anything on the Web.
2. Data of all types exist on the Web, e.g., structured tables, semi structured Web pages,
unstructured texts, and multimedia files (images, audios, and videos).
3. Information on the Web is heterogeneous. Due to the diverse authorship of Web pages,
multiple pages may present the same or similar information using completely different
words and/or formats. This makes integration of information from multiple pages a
challenging problem.
4. A significant amount of information on the Web is linked. Hyperlinks exist among Web
pages within a site and across different sites. Within a site, hyperlinks serve as information
organization mechanisms. Across different sites, hyperlinks represent implicit conveyance
of authority to the target pages. That is, those pages that are linked (or pointed) to by many
other pages are usually high quality pages or authoritative pages simply because many
people trust them.
5. The information on the Web is noisy. The noise comes from two main sources. First, a
typical Web page contains many pieces of information, e.g., the main content of the page,
navigation links, advertisements, copyright notices, privacy policies, etc. For a particular
application, only part of the information is useful. The rest is considered noise. To perform
fine-grain Web information analysis and data mining, the noise should be removed.
Second, due to the fact that the Web does not have quality control of information, i.e., one
WEB MINING
Page 5
can write almost anything that one likes, a large amount of information on the Web is of
low quality, erroneous, or even misleading.
6. The Web is also about services. Most commercial Web sites allow people to perform
useful operations at their sites, e.g., to purchase products, to pay bills, and to fill in forms.
7. The Web is dynamic. Information on the Web changes constantly. Keeping up with the
change and monitoring the change are important issues for many applications.
8. The Web is a virtual society. The Web is not only about data, information and services,
but also about interactions among people, organizations and automated systems. One can
communicate with people anywhere in the world easily and instantly, and also express
one’s views on anything in Internet forums, blogs and review sites.
All these characteristics present both challenges and opportunities for mining and
discovery of information and knowledge from the Web.
To explore information mining on the Web, it is necessary to know data mining,
which has been applied in many Web mining tasks. However, Web mining is not entirely
an application of data mining. Due to the richness and diversity of information and other
Web specific characteristics discussed above, Web mining has developed many of its own
algorithms.
1.3.1 WHAT IS DATA MINING?
Data mining is also called knowledge discovery in databases (KDD). It is
commonly defined as the process of discovering useful patterns or knowledge from data
sources, e.g., databases, texts, images, the Web, etc. The patterns must be valid, potentially
useful, and understandable. Data mining is a multi-disciplinary field involving machine
learning,
statistics,
databases,
artificial
intelligence,
information
retrieval,
and
visualization. There are many data mining tasks. Some of the common ones are supervised
learning (or classification), unsupervised learning (or clustering), association rule mining,
and sequential pattern mining. We will discuss all of them in this seminar.
A data mining application usually starts with an understanding of the application
domain by data analysts (data miners), who then identify suitable data sources and the
WEB MINING
Page 6
target data. With the data, data mining can be performed, which is usually carried out in
three main steps:
� Pre-processing: The raw data is usually not suitable for mining due to various reasons.
It may need to be cleaned in order to remove noises or abnormalities. The data may also
be too large and/or involve many irrelevant attributes, which call for data reduction
through sampling and attribute selection.
� Data mining: The processed data is then fed to a data mining algorithm which will
produce patterns or knowledge.
� Post-processing: In many applications, not all discovered patterns are useful. This step
identifies those useful ones for applications. Various evaluation and visualization
techniques are used to make the decision.
The whole process (also called the data mining process) is almost always iterative.
It usually takes many rounds to achieve final satisfactory results, which are then
incorporated into real-world operational tasks. Traditional data mining uses structured data
stored in relational tables, spread sheets, or flat files in the tabular form. With the growth
of the Web and text documents, Web mining and text mining are becoming increasingly
important and popular.
1.3.2 WHAT IS WEB MINING?
Web mining aims to discover useful information or knowledge from the Web
hyperlink structure, page content, and usage data. Although Web mining uses many data
mining techniques, as mentioned above it is not purely an application of traditional data
mining due to the heterogeneity and semi-structured or unstructured nature of the Web
data. Many new mining tasks and algorithms were invented in the past decade. Based on
the primary kinds of data used in the mining process, Web mining tasks can be categorized
into three types: Web structure mining, Web content mining and Web usage mining.
Web mining is the use of data mining techniques to automatically discover and
extract information from Web documents and services. Web mining should be
decomposed into these subtasks:
1. Resource finding: The task of retrieving intended Web documents.
WEB MINING
Page 7
2. Information selection and pre processing: Automatically selecting and pre processing
specific information from retrieved Web resources.
3. Generalization: Automatically discovers general patterns at individual Web sites as well
as across multiple sites.
4. Analysis: Validation and/or interpretation of the mined patterns.
Resource finding is the process of retrieving data from text sources available on the
Web such as electronic magazines and newsletters or text contents of HTML documents.
Information selection and pre processing step is transformation process retrieved in
information retrieval (IR) process from original data. These transformations cover
removing stop words, finding phrases in the training corpus, transforming the
representation to relational or first order logic form, etc. Data mining techniques and
machine learning are often used for generalization.
In information and knowledge discovery process, people play very important role.
This is important for validation and/or interpretation in last step.
1.4 WEB MINING CATEGORIES
Web mining is categorized into three areas of interest based on part of Web to
mine:
1. Web content mining
• describes discovery of useful information from contents, data and
documents
• two different points of view: ir view and db view
2. Web structure mining
• model of link structures, topology of hyperlinks
• categorizing of web pages
3. Web usage mining
• mines secondary data derived from user interactions
Web content mining is the process of extracting knowledge from the content of
documents or their descriptions. Web structure mining is the process of inferring
knowledge from the Web organization and links between references and referents in the
WEB MINING
Page 8
Web. Finally, Web usage mining, also known as Web Log Mining, is the process of
extracting interesting patterns in Web access logs.
In this seminar, we will discuss all these three types of mining. However, due to
the richness and diversity of information on the Web, there are a large number of Web
mining tasks. We will not be able to cover them all. We will only focus on some important
tasks and their algorithms.
The Web mining process is similar to the data mining process. The difference is
usually in the data collection. In traditional data mining, the data is often already collected
and stored in a data warehouse. For Web mining, data collection can be a substantial task,
especially for Web structure and content mining, which involves crawling a large number
of target Web pages. We will devote a whole chapter on crawling.
Once the data is collected, we go through the same three-step process: data preprocessing, Web data mining and post-processing. However, the techniques used for each
step can be quite different from those used in traditional data mining.
WEB MINING
Page 9
2. DATA MINING
Data mining has attracted a great deal of attention in the information industry and
in society as a whole in recent years, due to the wide availability of huge amounts of data
and the imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from market
analysis, fraud detection, and customer retention, to production control and science
exploration.
Data mining can be viewed as a result of the natural evolution of information
technology. The database system industry has witnessed an evolutionary path in the
development of the following functionalities: data collection and database creation, data
management (including data storage and retrieval, and database transaction processing),
and advanced data analysis (involving data warehousing and data mining). For instance,
the early development of data collection and database creation mechanisms served as a
prerequisite for later development of effective mechanisms for data storage and retrieval,
and query and transaction processing. With numerous database systems offering query and
transaction processing as common practice, advanced data analysis has naturally become
the next target.
Since the 1960s, database and information technology has been evolving
systematically from primitive file processing systems to sophisticated and powerful
database systems. The research and development in database systems since the 1970s has
progressed from early hierarchical and network database systems to the development of
relational database systems, data modelling tools, and indexing and accessing methods. In
addition, users gained convenient and flexible data access through query languages, user
interfaces, optimized query processing, and transaction management. Efficient methods
for on-line transaction processing (OLTP), where a query is viewed as a read-only
transaction, have contributed substantially to the evolution and wide acceptance of
relational technology as a major tool for efficient storage, retrieval, and management of
large amounts of data.
Database technology since the mid-1980s has been characterized by the popular
adoption of relational technology and an upsurge of research and development activities
on new and powerful database systems. These promote the development of advanced data
WEB MINING
Page 10
models such as extended-relational, object-oriented, object-relational, and deductive
models. Application-oriented database systems, including spatial, temporal, multimedia,
active, stream, and sensor, and scientific and engineering databases, knowledge bases, and
office information bases, have flourished. Issues related to the distribution, diversification,
and sharing of data have been studied extensively. Heterogeneous database systems and
Internet-based global information systems such as the World Wide Web (WWW) have
also emerged and play a vital role in the information industry.
The steady and amazing progress of computer hardware technology in the past
three decades has led to large supplies of powerful and affordable computers, data
collection equipment, and storage media. This technology provides a great boost to the
database and information industry, and makes a huge number of databases and information
repositories available for transaction management, information retrieval, and data analysis.
Data can now be stored in many different kinds of databases and information
repositories. One data repository architecture that has emerged is the data warehouse, a
repository of multiple heterogeneous data sources organized under a unified schema at a
single site in order to facilitate management decision making. Data warehouse technology
includes data cleaning, data integration, and on-line analytical processing (OLAP), that is,
analysis techniques with functionalities such as summarization, consolidation, and
aggregation as well as the ability to view information from different angles. Although
OLAP tools support multidimensional analysis and decision making, additional data
analysis tools are required for in-depth analysis, such data classification, clustering, and
the characterization of data changes over time. In addition, huge volumes of data can be
accumulated beyond databases and data warehouses. Typical examples include the World
Wide Web and data streams, where data flow in and out like streams, as in applications
like video surveillance, telecommunication, and sensor networks. The effective and
efficient analysis of data in such different forms becomes a challenging task.
The abundance of data, coupled with the need for powerful data analysis tools, has
been described as a data rich but information poor situation. The fast-growing, tremendous
amount of data, collected and stored in large and numerous data repositories, has far
exceeded our human ability for comprehension without powerful tools. As a result, data
collected in large data repositories become “data tombs”—data archives that are seldom
visited. Consequently, important decisions are often made based not on the informationWEB MINING
Page 11
rich data stored in data repositories, but rather on a decision maker’s intuition, simply
because the decision maker does not have the tools to extract the valuable knowledge
embedded in the vast amounts of data. In addition, consider expert system technologies,
which typically rely on users or domain experts to manually input knowledge into
knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is
extremely time-consuming and costly. Data mining tools perform data analysis and may
uncover important data patterns, contributing greatly to business strategies, knowledge
bases, and scientific and medical research. The widening gap between data and
information calls for a systematic development of data mining tools that will turn data
tombs into “golden nuggets” of knowledge.
Figure 1: DATA MINING AS A STEP IN THE PROCESS OF KNOWLEDGE
DISCOVERY
WEB MINING
Page 12
2.1 DATA MINING BRIEF OVERVIEW
Simply stated, data mining refers to extracting or “mining” knowledge from large
amounts of data. The term is actually a misnomer. Remember that the mining of gold from
rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data
mining should have been more appropriately named “knowledge mining from data,”
which is unfortunately somewhat long. “Knowledge mining,” a shorter term, may not
reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid
term characterizing the process that finds a small set of precious nuggets from a great deal
of raw material. Thus, such a misnomer that carries both “data” and “mining” became a
popular choice. Many other terms carry a similar or slightly different meaning to data
mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as
simply an essential step in the process of knowledge discovery. Knowledge discovery as a
process is depicted in Figure 1 and consists of an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present the mined knowledge to the user)
WEB MINING
Page 13
Steps 1 to 4 are different forms of data pre processing, where the data are prepared
for mining. The data mining step may interact with the user or a knowledge base. The
interesting patterns are presented to the user and may be stored as new knowledge in the
knowledge base. Note that according to this view, data mining is only one step in the
entire process, albeit an essential one because it uncovers hidden patterns for evaluation.
We agree that data mining is a step in the knowledge discovery process. However,
in industry, in media, and in the at a base research milieu, the term data mining is
becoming more popular than the longer term of knowledge discovery from data.
Therefore, in this book, we choose to use the term data mining. We adopt a broad view of
data mining functionality: data mining is the process of discovering interesting knowledge
from large amounts of data stored in databases, data warehouses, or other information
repositories.
Based on this view, the architecture of a typical data mining system may have the
following major components (Figure 2):
Database, data warehouse, World Wide Web, or other information repository: This is
one or a set of databases, data warehouses, spreadsheets, or other kinds of information
repositories. Data cleaning and data integration techniques may be performed on the data.
Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining request.
Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and metadata
(e.g., describing data from multiple heterogeneous sources).
Data mining engine: This is essential to the data mining system and ideally consists of a
set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution
analysis.
WEB MINING
Page 14
Figure 2: ARCHITECTURE OF A TYPICAL DATA MINING SYSTEM
Pattern evaluation module: This component typically employs interestingness measures
(Section 1.5) and interacts with the data mining modules so as to focus the search toward
interesting patterns. It may use interestingness thresholds to filter out discovered patterns.
Alternatively, the pattern evaluation module may be integrated with the mining module,
depending on the implementation of the data mining method used. For efficient data
mining, it is highly recommended to push the evaluation of pattern interestingness as deep
as possible into the mining process so as to confine the search to only the interesting
patterns.
User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory data mining
based on the intermediate data mining results. In addition, this component allows the user
WEB MINING
Page 15
to browse database and data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms.
From a data warehouse perspective, data mining can be viewed as an advanced
stage of online analytical processing (OLAP). However, data mining goes far beyond the
narrow scope of summarization-style analytical processing of data warehouse systems by
incorporating more advanced techniques for data analysis.
Although there are many “data mining systems” on the market, not all of them can
perform true data mining. A data analysis system that does not handle large amounts of
data should be more appropriately categorized as a machine learning system, a statistical
data analysis tool, or an experimental system prototype. A system that can only perform
data or information retrieval, including finding aggregate values, or that performs
deductive query answering in large databases should be more appropriately categorized as
a database system, an information retrieval system, or a deductive database system.
2.2 DATA MINING TECHNIQUES
Neural Networks/Pattern Recognition - Neural Networks are used in a blackbox
fashion. One creates a test data set, lets the neural network learn patterns based on known
outcomes, then sets the neural network loose on huge amounts of data. For example, a
credit card company has 3,000 records, 100 of which are known fraud records. The data
set updates the neural network to make sure it knows the difference between the fraud
records and the legitimate ones. The network learns the patterns of the fraud records. Then
the network is run against company’s million record data set and the network spits out the
records with patterns the same or similar to the fraud records. Neural networks are known
for not being very helpful in teaching analysts about the data, just finding patterns that
match. Neural networks have been used for optical character recognition to help the Post
Office automate the delivery process without having to use humans to read addresses.
Memory Based Reasoning - This technique has results similar to neural network but goes
about it differently. MBR looks for "neighbour" kind of data, rather than patterns. If you
look at insurance claims and want to know which the adjudicators should look at and
which they can just let go through the system, you would set up a set of claims you want
adjudicated and let the technique find similar claims.
WEB MINING
Page 16
Cluster Detection/Market Basket Analysis - This is where the classic beer/diapers
bought together analysis came from. It finds groupings. Basically, this technique finds
relationships in product or customer or wherever you want to find associations in data.
Link Analysis - This is another technique for associating like records. Not used too much,
but there are some tools created just for this. As the name suggests, the technique tries to
find links, either in customers, transactions, etc. and demonstrate those links.
Visualization - This technique helps users understand their data. Visualization makes the
bridge from text based to graphical presentation. Such things as decision tree, rule, cluster
and pattern visualization help users see data relationships rather than read about them.
Many of the stronger data mining programs have made strides in improving their visual
content over the past few years. This is really the vision of the future of data mining and
analysis. Data volumes have grown to such huge levels, it is going to be impossible for
humans to process it by any text-based method effectively, soon. We will probably see an
approach to data mining using visualization appear that will be something like Microsoft’s
Photosynth. The technology is there, it will just take an analyst with some vision to sit
down and put it together.
Decision Tree/Rule Induction - Decision trees use real data mining algorithms. Decision
trees help with classification and spit out information that is very descriptive, helping
users to understand their data. A decision tree process will generate the rules followed in a
process. For example, a lender at a bank goes through a set of rules when approving a
loan. Based on the loan data a bank has, the outcomes of the loans (default or paid), and
limits of acceptable levels of default, the decision tree can set up the guidelines for the
lending institution. These decision trees are very similar to the first decision support (or
expert) systems.
Genetic Algorithms - GAs are techniques that act like bacteria growing in a Petri dish.
You set up a data set then give the GA ability to do different things for whether a direction
or outcome is favourable. The GA will move in a direction that will hopefully optimize the
final result. GAs are used mostly for process optimization, such as scheduling, workflow,
batching, and process re-engineering. Think of GA as simulations run over and over to
find optimal results and the infrastructure around being able to both run the simulations
and the ways to set up which results are optimal.
WEB MINING
Page 17
OLAP – Online Analytical Processing. OLAP allows users to browse data following
logical questions about the data. OLAP generally includes the ability to drill down into
data, moving from highly summarized views of data into more detailed views. This is
generally achieved by moving along hierarchies of data. For example, if one were
analyzing populations, one could start with the most populous continent, then drill down
to the most populous country, then to the state level, then to the city level, then to the
neighbourhood level. OLAP also includes browsing up hierarchies (drill up), across
different dimensions of data (drill across), and many other advanced techniques for
browsing data, such as automatic time variation when drilling up or down time hierarchies.
OLAP is by far the most implemented and used technique. It is also generally the most
intuitive and easy to use.
2.3 KIND OF DATA
We examine a number of different data repositories on which mining can be
performed. In principle, data mining should be applicable to any kind of data repository, as
well as to transient data, such as data streams. Thus the scope of our examination of data
repositories will include relational databases, data warehouses, transactional databases,
advanced database systems, flat files, data streams, and the World Wide Web. Advanced
database systems include object-relational databases and specific application-oriented
databases, such as spatial databases, time-series databases, text databases, and multimedia
databases. The challenges and techniques of mining may differ for each of the repository
systems.
2.3.1 RELATIONAL DATABASES
A database system, also called a database management system (DBMS), consists
of a collection of interrelated data, known as a database, and a set of software programs to
manage and access the data. The software programs involve mechanisms for the definition
of database structures; for data storage; for concurrent, shared, or distributed data access;
and for ensuring the consistency and security of the information stored, despite system
crashes or attempts at unauthorized access.
WEB MINING
Page 18
2.3.2 DATAWAREHOUSES
A data warehouse is a repository of information collected from multiple sources, stored
under a unified schema, and that usually resides at a single site. Data warehouses are
constructed via a process of data cleaning, data integration, data transformation, data
loading, and periodic data refreshing. To facilitate decision making, the data in a data
warehouse are organized around major subjects, such as customer, item, supplier, and
activity. The actual physical structure of a data warehouse may be a relational data store or
a multidimensional data cube. A data cube provides a multidimensional view of data and
allows the pre computation and fast accessing of summarized data.
2.3.3 TRANSACTIONAL DATABASES
In general, a transactional database consists of a file where each record represents a
transaction. A transaction typically includes a unique transaction identity number (trans
ID) and a list of the items making up the transaction. The transactional database may have
additional tables associated with it, which contain other information.
2.4 CLASSIFICATION OF DATA MINING SYSTEMS
Data mining is an interdisciplinary field, the confluence of a set of disciplines,
including database systems, statistics, machine learning, visualization, and information
science. Moreover, depending on the data mining approach used, techniques from other
disciplines may be applied, such as neural networks, fuzzy and/or rough set theory,
knowledge representation, inductive logic programming, or high-performance computing.
Depending on the kinds of data to be mined or on the given data mining application, the
data mining system may also integrate techniques from spatial data analysis, information
retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web
technology, economics, business, bioinformatics, or psychology. Because of the diversity
of disciplines contributing to data mining, data mining research is expected to generate a
large variety of data mining systems. Therefore, it is necessary to provide a clear
classification of data mining systems, which may help potential users distinguish between
such system sand identify those that best match their needs. Data mining systems can be
categorized according to various criteria, as follows:
WEB MINING
Page 19
Classification according to the kinds of databases mined: A data mining system can be
classified according to the kinds of databases mined. Database systems can be classified
according to different criteria (such as data models, or the types of data or applications
involved), each of which may require its own data mining technique. Data mining systems
can therefore be classified accordingly. For instance, if classifying according to data
models, we may have a relational, transactional, object-relational, or data warehouse
mining system. If classifying according to the special types of data handled, we may have
a spatial, time-series, text, stream data, multimedia data mining system, or a World Wide
Web mining system.
Classification according to the kinds of knowledge mined: Data mining systems can be
categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation
analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. A
comprehensive data mining system usually provides multiple and/or integrated data
mining functionalities.
Classification according to the kinds of techniques utilized: Data mining systems can
be categorized according to the underlying data mining techniques employed. These
techniques can be described according to the degree of user interaction involved (e.g.,
autonomous systems, interactive exploratory systems, query-driven systems) or the
methods of data analysis employed (e.g., database-oriented or data warehouse– oriented
techniques, machine learning, statistics, visualization, pattern recognition, neural
networks, and so on). A sophisticated data mining system will often adopt multiple data
mining techniques or work out an effective, integrated technique that combines the merits
of a few individual approaches.
Classification according to the applications adapted: Data mining systems can also be
categorized according to the applications they adapt. For example, data mining Systems
may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail,
and so on. Different applications often require the integration of application-specific
methods. Therefore, a generic, all-purpose data mining system may not fit domain-specific
mining tasks.
WEB MINING
Page 20
2.5 DATA MINING TASK PRIMITIVES
Each user will have a data mining task in mind, that is, some form of data analysis
that he or she would like to have performed. A data mining task can be specified in the
form of a data mining query, which is input to the data mining system. A data mining
query is defined in terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during discovery in order to direct
the mining process, or examine the findings from different angles or depths. The data
mining primitives specify the following.
The set of task-relevant data to be mined: This specifies the portions of the database or
the set of data in which the user is interested. This includes the database attributes or data
warehouse dimensions of interest (referred to as the relevant attributes or dimensions).
The kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about
the domain to be mined is useful for guiding the knowledge discovery process and for
evaluating the patterns found. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction. User beliefs
regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation: They may be
used to guide the mining process or, after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.
The expected representation for visualizing the discovered patterns: This refers to the
form in which discovered patterns are to be displayed, which may include rules, tables,
charts, graphs, decision trees, and cubes.
WEB MINING
Page 21
3. WEB MINING
The following figure shows the architecture of web mining briefly. It divides
into two stages. The stage1 contains all information of data & stage2 contains analysis.
According to analysis target, web mining can be divided into three different types,
which are web usage mining, web content mining and web structure mining.
WEB MINING ARCHITECTURE
Data Cleaning
Server data log
Transaction
Identification
Clean log
Data
Integration
Transformation
Pattern Discovery
Pattern Analysis
Transaction data
Path
Analysis
OLAP/
Visualisation
Tools
Association
Rules
Registration data
Name
Address
Sequential
Pattern
Marks
Attar
Documents
and Usage
Attributes
STAGE 1
Database
Query
Languag
e
Clusters
and
Classification
Rules
Knowledge
Query
Mechanism
Intelligent
Agent
STAGE 2
Figure 3: ARCHITECTURE OF WEB MINING
Data mining is the nontrivial process of identifying valid novel, potentially useful,
and ultimately understandable patterns in data Fayyad. The most commonly used
techniques in data mining is artificial neural networks, decision trees, genetic algorithm,
nearest neighbour method, and rule induction. Data mining research has drawn on a
number of other fields such as inductive learning, machine learning and statistics etc.
WEB MINING
Page 22
Machine learning – is the automation of a learning process and learning is based on
observations of environmental statistics and transitions.
Machine learning examines
previous examples and their outcomes and learns how to reproduce these make
generalizations about new uses.
Inductive learning – Induction means inference of information from data and Inductive
learning is a model building process where the database is analyzed to find patterns. Main
strategies are supervised learning and unsupervised learning.
Statistics: used to detect unusual patterns and explain patterns using statistical models
such as linear models.
Data mining models can be a discovery model – it is the system automatically discovering
important information hidden in the data or verification model – takes an hypothesis from
the user and tests the validity of it against the data.
The web contains collection of pages that includes countless hyperlinks and huge
volumes of access and usage information. Because of the ever-increasing amount of
information in cyberspace, knowledge discovery and web mining are becoming critical for
successfully conducting business in the cyber world. Web mining is the discovery and
analysis of useful information from the web. Web mining is the use of data mining
techniques to automatically discover and extract information from web documents and
services (content, structure, and usage).
3.1 APPROACHES OF WEB MINING
Two different approaches were taken in initially defining web mining.
i.
Process centric View – Web mining as a sequence of tasks
ii.
Data centric view – web mining as a web data that was being used in the
mining process.
3.2 MINING TECHNIQUES
The important data mining techniques applied in the web domain include
Association Rule, Sequential pattern discovery, clustering, path analysis, classification and
outlier discovery.
WEB MINING
Page 23
1. Association Rule Mining: Predict the association and correlation among set of
items “where the presence of one set of items in a transaction implies (with a
certain degree of confidence) the presence of other itms. That is,
1) Discovers the correlations between pages that are most often referenced together
in a single server session/user session.
2) Provide the information:
i. What are the set of pages frequently accessed together by web users?
ii. What page will be fetched next?
iii. What are paths frequently accessed by web users?
3) Associations and correlations:
i. Page association from usage data – user sessions, user transactions.
ii. Page associations from content data – similarity based on content analysis
iii. Page associations based on structure -- link connectivity between pages.
Advantages:

Guide for web site restructuring – by adding links that interconnect pages often
viewed together.

Improve the system performance by pre fetching web data.
2. Sequential pattern discovery: Applied to web access server transaction logs. The
purpose is to discover sequential patterns that indicate user visit patterns over a
certain period. That is, the order in which URLs tend to be accessed.
Advantage:

Useful user trends can be discovered.

Predictions concerning visit pattern can be made.

To improve website navigation.

Personalize advertisements.

Dynamically reorganize link structure and adopt web site contents to individual
client requirements or to provide clients with automatic recommendations that
best suit customer profiles.
3. Clustering: Group together items (users, pages, etc.,) that have similar
characteristics.
a) Page clusters: groups of pages that seem to be conceptually related according to
users’ perception.
WEB MINING
Page 24
b) User Cluster: groups or users that seem to be behave similarly when navigating
through a web site.
4. Classification: maps a data item into one of several predetermined classes.
Example: describing each users category using profiles. Classification algorithms
are decision tree, naïve Bayesian classifier, neural networks.
5. Path Analysis: A technique that involves the generation of some form of graph
that “represents relation defined on web pages. This can be the physical layout of
a web site in which the web pages are nodes and links between these pages are
directed edges.
Most graphs are involved in determining frequent traversal
patterns/ more frequently visited paths in a web site.
Example: What paths do users traversal before they go to a particular URL?
To use data mining on our web site, we have to establish and record visitor and
item characteristics, and visitor interactions.
Visitor characteristics include:
i. Demographics – are tangible attributes such as home address, income,
property, etc.
ii. Psychographics – are personality types such as early technology interest,
buying tendencies.
iii. Techno graphics – are attributes of visitor’s system, such as operating
system, browser, and modem speed.
Item characteristics include:
i. Web content information – media type, content category, URL.
ii. Product information - product category, colour, size, price
Visitor interactions include:
i. Visitor item interactions include purchase history, advertising history, and
preference information.
ii. Visitor site statistics are per session characteristics, such as total time,
pages viewed, and so on.
We have a lot of information about web visitors and content, but we probably are
not making the best use of it. The existing OLAP systems can report only on directly
WEB MINING
Page 25
observed and easily correlated information. They rely on users to discover patterns and
decide what to do with them. The information is even too complex for humans to discover
these patterns using an OLAP system. To solve these problems, data mining techniques
are utilized.
The scope of data mining is
i. Automated prediction of trends, and behaviours
ii. Automated discovery of previously unknown patterns.
Web mining is searches for
i. Web access patterns,
ii. Web structure,
iii. Regularity and dynamics of web contents.
The web mining research is a converging research area from several research
communities, such as database, information retrieval, and AI research communities,
especially from machine learning and natural language processing. World wide web is a
popular and interactive medium to gather information today. The WWW provides every
Internet citizen with access to an abundance of information. Users encounter some
problems when interacting with the web.
i. Finding relevant information (information overload – Only a small portion of
the web pages contain truly relevant/useful information):
a) low precision (the abundance problem – 99% of information of no interest
to 99% of people) – which is due to the irrelevance of many of the search
results. This results in a difficulty of finding the relevant information.
b) Low recall (limited coverage of the web-Internet sources hidden behind
search interface) – due to the inability to index all the information available
on the web.
This results in a difficulty of finding the unindexed
information that is relevant.
ii. Discovery of existing but “hidden knowledge (retrieve 1/3rd of the “indexable
web”)
iii. Personalization of the information (type & presentation of information)
Limited customization to individual users.
iv. Learning about customers/individual users.
WEB MINING
Page 26
v. Lack of feedback on human activities.
vi. Lack of multidimensional analysis and data mining support.
vii. The web constitutes a highly dynamic information source. Not only does the
web continue to grow rapidly, the information I holds also receives constant
updates. News, stock market, service centre, and corporate sites revise their
web pages regularly. Linkage information and access records also undergo
frequent updates.
viii. The web serves a broad spectrum of user communities. The Internet’s rapidly
expanding user community connects millions of workstations, and usage
purposes. Many lack good knowledge of the information network’s structure,
are unaware of a particular search’s heavy cost, frequently get lost within the
web’s ocean of information and lengthy waits required to retrieve search
results.
ix. Web page complexity far exceeds the complexity of any traditional text
document collection. Although the web functions as a huge digital library, the
pages themselves lack a uniform structure and contain far more authoring style
and content variations than any set of books or traditional text-based
documents. Moreover, searching it is extremely difficult.
Common problems web marketers want to solve are how to target advertisements
(Targeting), Personalize web pages (Personalization), create web pages that show products
often bought together (associations), classify articles automatically (Classification),
characterize group of similar visitors (clustering), estimate missing data and predict future
behaviour.
In general web mining tasks are:
i. Mining web search engine data
ii. Analyzing the web’s link structures
iii. Classifying web document automatically
iv. Mining web page semantic structure and page contents
v. Mining web dynamics
vi. Personalization.
Thus, web mining refers to the overall process of discovering potentially useful
and previously unknown information or knowledge from the web data. Web mining aims
WEB MINING
Page 27
at finding and extracting relevant information that is hidden in web-related data, in
particular in text documents that are published on the web like data mining is a multidisciplinary effort that draws technique from fields like information retrieval, statistics,
machine learning, natural language processing and others.
Web mining can be a
promising tool to address ineffective search engines that produce incomplete indexing,
retrieval of irrelevant information/unverified reliability or retrieved information. It is
essential to have a system that helps the user find relevant and reliable information easily
and quickly on the web. Web mining discovers information from mounds of data on the
www, but it also monitors and predicts user visit patterns. This gives designers more
reliable information in structuring and designing a web site.
Given the rate of growth of the web, scalability of search engines is a key issue, as
the amount of hardware and network resources needed is large, and expensive.
In
addition, search engines are popular tools, so they have heavy constraints on query answer
time. So, the efficient use of resources can improve both scalability and answer time. One
tool to achieve this goal is web mining.
WEB MINING
Page 28
3.3 WEB MINING TAXONOMY
Web mining can be broadly divided into three distinct categories, according to the kinds
of data to be mined. Figure 4 shows the taxonomy.
Figure 4: WEB MINING TAXONOMY
3.3.1 Web Content Mining
Web content mining is the process of extracting useful information from the
contents of web documents. Content data is the collection of facts a web page is designed
to contain. It may consist of text, images, audio, video, or structured records such as lists
and tables. Application of text mining to web content has been the most widely
researched. Issues addressed in text mining include topic discovery and tracking,
extracting association patterns, clustering of web documents and classification of web
pages. Research activities on this topic have drawn heavily on techniques developed in
other disciplines such as Information Retrieval (IR) and Natural Language Processing
(NLP). While there exists a significant body of work in extracting knowledge from images
in the fields of image processing and computer vision, the application of these techniques
to web content mining has been limited.
WEB MINING
Page 29
Web content mining is an automatic process that goes beyond keywords extraction.
Since the content of a text document presents no machine-readable semantic, some
approaches have suggested to restricted the document content in a representation that
could be exploited by machines. The usual approach to exploit known structure in
document is to use wrappers to map document to some data model. Techniques using
lexicons for content interpretation are yet to come. There are two groups of web content
mining strategies: those that directly mine the content of document and those that improve
on the content search of other tools like search engines.
3.3.2 Web Structure Mining
The structure of a typical web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. Web structure mining is the process of
discovering structure information from the web. This can be further divided into two kinds
based on the kind of structure information used.
Hyperlinks
A hyperlink is a structural unit that connects a location in a web page to a different
location, either within the same web page or on a different web page. A hyperlink that
connects to a different part of the same page is called an intra-document hyperlink, and a
hyperlink that connects two different pages is called an inter-document hyperlink.
Document Structure
In addition, the content within a Web page can also be organized in a tree
structured format, based on the various HTML and XML tags within the page. Mining
efforts here have focused on automatically extracting document object model (DOM)
structures out of documents (Wang and Liu 1998; Moh, Lim, and Ng 2000).
World Wide Web can reveal more information than just the information contained
in documents. For example, links pointing to a document indicate the popularity of the
document, while links coming out of a document indicate the richness or perhaps the
variety of topics covered in the document. This can be compared to bibliography citations.
When a paper is cited often, it ought to be important. The page rank and CLEVER
methods take advantage of this information conveyed by the links to find pertinent web
WEB MINING
Page 30
pages. By means of counters, higher levels cumulate the number of artefacts’ subsumed by
the concepts they hold.
3.3.3 Web Usage Mining
Web usage mining is the application of data mining techniques to discover
interesting usage patterns from web usage data, in order to understand and better serve the
needs of web-based applications. Usage data captures the identity or origin of web users
along with their browsing behaviour at a web site. Web usage mining itself can be
classified further depending on the kind of usage data considered:
Web Server Data
User logs are collected by the web server and typically include IP address, page
reference and access time.
Application Server Data
Commercial application servers such as Weblogic1,2 StoryServer3 have significant
features to enable E-commerce applications to be built on top of them with little effort. A
key feature is the ability to track various kinds of business events and log them in
application server logs.
Application Level Data
New kinds of events can be defined in an application, and logging can be turned on
for them generating histories of these events. It must be noted, however, that many end
applications require a combination of one or more of the techniques applied in the above
the categories.
Web servers record and accumulate data about user interactions whenever
requests for resources are received. Analyzing the web access logs of different web sites
can help understand the user behaviour and the web structure, thereby improving the
design of this colossal collection of resources. There are two main tendencies in web usage
mining driven by the applications of discoveries: general access pattern tracking and
customized usage tracking. The general access pattern tracking analyzes the web logs to
understand accesses patterns and trends.
WEB MINING
Page 31
3.4 THE AXES OF WEB MINING
3.4.1 WWW Impact
The World Wide Web has grown in the past few years from a small research
community to the biggest and most popular way of communication and information
dissemination. Every day, the WWW grows by roughly a million electronic pages, adding
to the hundreds of millions already on-line. WWW serves as a platform for exchanging
various kinds of information, ranging from research papers, and educational content, to
multimedia content and software. The continuous growth in the size and the use of the
WWW imposes new methods for processing these huge amounts of data. Because of its
rapid and chaotic growth, the resulting network of information lacks of organization and
structure. Moreover, the content is published in various diverse formats.
3.4.2 Web data
Web data are those that can be collected and used in the context of Web
Personalization. These data are classified in four categories according to servers, i.e., web
usage mi Content data are presented to the end-user appropriately structured. They can be
simple text, images, or structured data, such as information retrieved from databases.

Structure data represent the way content is organized. They can be either data
entities used within a Web page, such as HTML or XML tags, or data entities used
to put a Web site together, such as hyperlinks connecting one page to another.

Usage data represent a Web site’s usage, such as a visitor’s IP address, time and
date of access, complete path accessed, referrers’ address, and other attributes that
can be included in a Web access log.

User profile data provide information about the users of a Web site. A user profile
contains demographic information for each user of a Web site, as well as
information about users’ interests and preferences. Such information is acquired
through registration forms or questionnaires, or can be inferred by analyzing Web
usage logs.
WEB MINING
Page 32
3.5 WEB MINING PROS AND CONS
PROS
Web mining essentially has many advantages which makes this technology
attractive to corporations including the government agencies. This technology has enabled
ecommerce to do personalized marketing, which eventually results in higher trade
volumes. The government agencies are using this technology to classify threats and fight
against terrorism. The predicting capability of the mining application can benefits the
society by identifying criminal activities. The companies can establish better customer
relationship by giving them exactly what they need. Companies can understand the needs
of the customer better and they can react to customer needs faster. The companies can
find, attract and retain customers; they can save on production costs by utilizing the
acquired insight of customer requirements. They can increase profitability by target
pricing based on the profiles created. They can even find the customer who might default
to a competitor the company will try to retain the customer by providing promotional
offers to the specific customer, thus reducing the risk of losing a customer or customers.
CONS
Web mining, itself, doesn’t create issues, but this technology when used on data of
personal nature might cause concerns. The most criticized ethical issue involving web
mining is the invasion of privacy. Privacy is considered lost when information concerning
an individual is obtained, used, or disseminated, especially if this occurs without their
knowledge or consent. The obtained data will be analyzed, and clustered to form profiles;
the data will be made anonymous before clustering so that there are no personal profiles.
Thus these applications de-individualize the users by judging them by their mouse clicks.
De-individualization, can be defined as a tendency of judging and treating people on the
basis of group characteristics instead of on their own individual characteristics and merits.
Another important concern is that the companies collecting the data for a specific purpose
might use the data for a totally different purpose, and this essentially violates the user’s
interests. The growing trend of selling personal data as a commodity encourages website
owners to trade personal data obtained from their site. This trend has increased the amount
of data being captured and traded increasing the likeliness of one’s privacy being invaded.
The companies which buy the data are obliged make it anonymous and these companies
WEB MINING
Page 33
are considered authors of any specific release of mining patterns. They are legally
responsible for the contents of the release; any inaccuracies in the release will result in
serious lawsuits, but there is no law preventing them from trading the data. Some mining
algorithms might use controversial attributes like sex, race, religion, or sexual orientation
to categorize individuals. These practices might be against the anti-discrimination
legislation. The applications make it hard to identify the use of such controversial
attributes, and there is no strong rule against the usage of such algorithms with such
attributes. This process could result in denial of service or a privilege to an individual
based on his race, religion or sexual orientation, right now this situation can be avoided by
the high ethical standards maintained by the data mining company. The collected data is
being made anonymous so that the obtained data and the obtained patterns cannot be
traced back to an individual. It might look as if this poses no threat to one’s privacy,
actually many extra information can be inferred by the application by combining two
separate unscrupulous data from the user.
WEB MINING
Page 34
4. WEB CONTENT MINING
Web content mining is the process of extracting useful information from the
content of Web documents. Logical structure, semantic content and layout are contained in
semi structured Web page text. Topic discovery, extracting association patterns, clustering
of Web documents and classification of Web pages are some of research issues in text
mining. These activities use techniques from other disciplines – IR, IE (information
extraction), NLP (natural language processing) and others. Automatic extraction of
semantic relations and structures from Web is a growing application of Web content
mining. In this area, several algorithms are used: Hierarchical clustering algorithms on
terms in order to create concept hierarchies, formal concept analysis and association rule
mining to learn generalized conceptual relations and automatic extraction of structured
data records from semi-structured HTML pages. Primary goal of each algorithm is to
create a set of formally defined domain ontologies that represent Web site content.
Common representation approaches are vector-space models, descriptive logics, first order
logic, relational models and probabilistic relational models. Structured data extraction is
one of most widely studied research topics of Web content mining. Structured data on the
Web are often very important as they represent their host pages’ essential information.
Extracting such data allows one to provide value added services, e.g. shopping and meta
search. In contrast to unstructured texts, structured data is also easier to extract. This
problem has been studied by researchers in AI and database and data mining.
Discovery of useful information from the web contents/data/documents (or) is the
application of data mining techniques to content published on the Internet. The web
contains many kinds and types of data. Basically, the web content consists of several
types of data such as plain text (unstructured), image, audio, video, meta data as well as
HTML (semi Structured), or XML (structured documents), dynamic documents,
multimedia documents. Recent research on mining multi types of data is termed
multimedia data mining. Thus we could consider multimedia data mining as an instance of
web content mining. The research around applying data mining techniques to unstructured
text is termed knowledge discovery in texts/ text data mining/ text mining. Hence we
could consider text mining as an instance as an instance of web content mining. Research
issues addressed in text mining are: topic discovery, extracting association patterns,
clustering of web documents and classification of web pages.
WEB MINING
Page 35
4.1 ISSUES IN WEB CONTENT MINING:
 developing intelligent tools for information retrieval
 finding keywords and key phases
 discovering grammatical rules collections
 hypertext classification/categorization
 extracting key phrases from text documents
 learning extraction rules
 hierarchical clustering
 predicting relationships
4.2 WEB CONTENT MINING APPROACHES:
CONTENT MINING
Agent Based Approach
Data Base Approach
1. Intelligent Search Agents
1. Multilevel Databases
2. Information Filtering/Categorization
2. Web Query Systems
3. Personalized Web Agents
4.2.1 AGENT BASED APPROACHES:
Involves AI systems that can “act autonomously or semi autonomously on behalf
of a particular user, to discover and organize web based information”. Agent Based
approaches focus on intelligent and autonomous web mining tools based on agent
technology.
i. Some intelligent web agents can use a user profile to search for relevant
information, then organize and interpret the discovered information.
Example: Harvest.
WEB MINING
Page 36
ii. Some use various information retrieval techniques and the characteristics of open
hypertext documents to organize and filter retrieved information.
Example: Hypursuit.
iii. Learn user preferences and use those preferences to discover information sources
for those particular users.
Agent-based Web mining systems can be placed into the following three categories:
Intelligent Search Agents: Several intelligent Web agents have been developed that
search for relevant information using domain characteristics and user pro les to organize
and interpret the discovered information. Agents such as Harvest, FAQ-Finder,
Information Manifold, OCCAM, and Para Site rely either on pre-specified domain
information about particular types of documents, or on hard coded models of the
information sources to retrieve and interpret documents. Agents such as Shop Bot and
ILA (Internet Learning Agent) interact with and learn the structure of unfamiliar
information sources. Shop Bot retrieves product information from a variety of vendor sites
using only general information about the product domain. ILA learns models of various
information sources and translates these into its own concept hierarchy.
Information Filtering/Categorization: A number of Web agents use various information
retrieval techniques and characteristics of open hypertext Web documents to automatically
retrieve categorize them. HyPursuit uses semantic information embedded in link structures
and document content to create cluster hierarchies of hypertext documents, and structure
an information space. BO (Bookmark Organizer) combines hierarchical clustering
techniques and user interaction to organize a collection of Web documents based on
conceptual information.
Personalized Web Agents: This category of Web agents learn user preferences and
discover Web information sources based on these preferences, and those of other
individuals with similar interests.
4.2.1 DATA BASE APPROACHES:
Data base approach: focuses on “integrating and organizing the heterogeneous and
semi-structured data on the web into more structured and high level collections of
resources”. These organized resources can then be accessed and analyzed. These metadata,
or generalization are then organized into structured collections and can be analyzed.
WEB MINING
Page 37
Database approaches to Web mining have focused on techniques for organizing
the semi-structured data on the Web into more structured collections of resources, and
using standard database querying mechanisms and data mining techniques to analyze it.
Multilevel Databases: The main idea behind this approach is that the lowest level of the
database contains semi-structured information stored in various Web repositories, such as
hypertext documents. At the higher level meta data or generalizations are extracted from
lower levels and organized in structured collections, i.e. relational or object-oriented
databases. For example, Han, et. Al. use a multilayered database where each layer is
obtained via generalization and transformation operations performed on the lower layers.
Kholsa, et. al. propose the creation and maintenance of meta-databases at each information
providing domain and the use of a global schema for the meta-database. The incremental
integration of a portion of the schema from each information source, rather than relying on
a global heterogeneous database schema. The ARANEUS system extracts relevant
information from hypertext documents and integrates these into higher-level derived Web
Hypertexts which are generalizations of the notion of database views.
Web Query Systems: Many Web-based query systems and languages utilize standard
database query languages such as SQL, structural information about Web documents, and
even natural language processing for the queries that are used in World Wide Web
searches. W3QL combines structure queries, based on the organization of hypertext
documents, and content queries, based on information retrieval techniques. Web Log
Logic-based query language for restructuring extracts information from Web in- formation
sources. Lorel and UnQL query heterogeneous and semi-structured information on the
Web using a labelled graph data model. TSIMMIS extracts data from heterogeneous and
semi-structured information sources and correlates them to generate an integrated database
representation of the extracted information.
4.3 WEB CONTENT MINING TASK
4.3.1 Structured Data Extraction
This is perhaps the most widely studied research topic of Web content mining. One
of the reasons for its importance and popularity is that structured data on the Web are
often very important as they represent their host pages. Essential information, e.g., lists of
WEB MINING
Page 38
products and services. Extracting such data allows one to provide value added services,
e.g., comparative shopping, and meta-search. Structured data is also easier to extract
compared to unstructured texts. This problem has been studied by researchers in AI,
database and data mining, and Web communities. There are several approaches to
structured data extraction, which is also called wrapper generation. The first approach is to
manually write an extraction program for each Web site based on observed format patterns
of the site. This approach is very labour intensive and time consuming. It thus does not
scale to a large number of sites. The second approach is wrapper induction or wrapper
learning, which is the main technique currently. Wrapper learning works as follows: The
user first manually labels a set of trained pages. A learning system then generates rules
from the training pages. The resulting rules are then applied to extract target items from
Web pages. The third approach is the automatic approach. Since structured data objects on
the Web are normally database records retrieved from underlying databases and displayed
in Web pages with some fixed templates. Automatic methods aim to find
patterns/grammars from the Web pages and then use them to extract data.
4.3.2 Unstructured Text Extraction
Most Web pages can be seen as text documents. Extracting information from Web
documents has also been studied by many researchers. The research is closely related to
text mining, information retrieval and natural language processing. Current techniques are
mainly based on machine learning and natural language processing to learn extraction
rules. Recently, a number of researchers also make use of common language patterns
(common sentence structures used to express certain facts or relations) and redundancy of
information on the Web to find concepts, relations among concepts and named entities.
The patterns can be automatically learnt or supplied by human users. Another direction of
research in this area is Web question-answering. Although question-answering was first
studied in information retrieval literature, it becomes very important on the Web as Web
offers the largest source of information and the objectives of many Web search queries are
to obtain answers to some simple questions. Extend question-answering to the Web by
query transformation, query expansion, and then selection.
4.3.3 Web Information Integration
WEB MINING
Page 39
Due to the sheer scale of the Web and diverse authorships, various Web sites may
use different syntaxes to express similar or related information. In order to make use of or
to extract information from multiple sites to provide value added services, e.g.,
metasearch, deep Web search, etc, one needs to semantically integrate information from
multiple sources. Recently, several researchers attempted this task. Two popular problems
related to the Web are (1) Web query interface integration, to enable querying multiple
Web databases and (2) schema matching, e.g., integrating Yahoo and Google.s directories
to match concepts in the hierarchies. The ability to query multiple deep Web databases is
attractive and interesting because the deep Web contains a huge amount of information or
data that is not indexed by general search engines.
4.3.4 Building Concept Hierarchies
Because of the huge size of the Web, organization of information is obviously an
important issue. Although it is hard to organize the whole Web, it is feasible to organize
Web search results of a given query. A linear list of ranked pages produced by search
engines is insufficient for many applications. The standard method for information
organization is concept hierarchy and/or categorization. The popular technique for
hierarchy construction is text clustering, which groups similar search results together in a
hierarchical fashion. Instead, it exploits existing organizational structures in the original
Web documents, emphasizing tags and language patterns to perform data mining to find
important concepts, sub-concepts and their hierarchical relationships. In other words, it
makes use of the information redundancy property and semi-structure nature of the Web to
find what concepts are important and what their relationships might be. This work aim to
compile a survey article or a book on the Web automatically.
4.3.5 Segmenting Web Pages & Detecting Noise
A typical Web page consists of many blocks or areas, e.g., main content areas,
navigation areas, advertisements, etc. It is useful to separate these areas automatically for
several practical applications. For example, in Web data mining, e.g., classification and
clustering, identifying main content areas or removing noisy blocks (e.g., advertisements,
navigation panels, etc) enables one to produce much better results. It was shown in that the
information contained in noisy blocks can seriously harm Web data mining. Another
application is Web browsing using a small screen device, such as a PDA. Identifying
WEB MINING
Page 40
different content blocks allows one to re-arrange the layout of the page so that the main
contents can be seen easily without losing any other information from the page.
4.3.6 Mining Web Opinion Sources
Consumer opinions used to be very difficult to obtain before the Web was
available. Companies usually conduct consumer surveys or engage external consultants to
find such opinions about their products and those of their competitors. Now much of the
information is publicly available on the Web. There are numerous Web sites and pages
containing consumer opinions, e.g., customer reviews of products, forums, discussion
groups, and blogs. This online word-of-mouth behaviour represents new and measurable
sources of information for marketing intelligence. Techniques are now being developed to
exploit these sources to help companies and individuals to gain such information
effectively and easily. For instance, proposes a feature based summarization method to
automatically analyze consumer opinions in customer reviews from online merchant sites
and dedicated review sites. The result of such a summary is useful to both potential
customers and product manufacturers.
WEB MINING
Page 41
5. WEB STRUCTURE MINING
Web Structure Mining operates on the web’s hyperlink structure. This graph
structure can provide information about page ranking or authoritativeness and enhance
search results through filtering i.e., tries to discover the model underlying the link
structures of the web. This model is used to analyze the similarity and relationship
between different web sites. Uses the hyperlink structure of the web as an additional
information source. This type of mining can be further divided into 2 kinds based on the
kind of structural data used.
a) HYPERLINKS:
A hyperlink is a structural unit that connects a web page to different location,
either within the same web page (intra document hyperlink) or to a different web
page (inter document) hyperlink.
b) DOCUMENT STRUCTURE:
In addition, the content within a web page can also be organized in a tree
structured format, based on various HTML and XML tags within the page. Mining
efforts here have focused on automatically extracting document object model
(DOM) structures out of documents.
Web link analysis used for:
1. ordering documents matching a user query (ranking)
2. deciding what pages to add to a collection
3. page categorization
4. finding related pages
5. finding duplicated web sites
6. and also to find out similarity between them
Web structure mining uses the hyperlink structure of the Web to yield useful
information,
including
definitive
pages
specification,
hyperlinked
communities
identification, Web pages categorization andWeb site completeness evaluation. Web
structure mining can be divided into two categories based on the kind of structured data
used:
WEB MINING
Page 42
1. Web graph mining: The Web provides additional information about how different
documents are connected to each other via hyperlinks. The Web can be viewed as a
(directed) graph whose nodes are Web pages and whose edges are hyperlinks
between them.
2. Deep Web mining: Web also contains a vast amount of non crawlable content. his
hidden part of the Web is referred to as the deep Web or the hidden Web.
Compared to the static surface Web, the deep Web contains a much larger amount
of high-quality structured information.
Most of mining algorithms, that are improving the performance of Web search, are
based on two assumptions.
(a) Hyperlinks convey human endorsement. If there exists a link from page A to
page B, and these two pages are authored by different people, then the first author
found the second page valuable. Thus the importance of a page can be propagated
to those pages it links to.
(b) Pages that are co-cited by a certain page are likely related to the same topic.
The popularity or importance of a page is correlated to the number of incoming
links to some extendt, and related pages tend to be clustered together through
dense linkages among them.
Web information extraction has the goal of pulling out information from a
collection of Web pages and converting it to a homogeneous form that is more readily
digested and analyzed for both humans and machines. The result of IE could be used to
improve the indexing process, because IE removes irrelevant information in Web pages
and facilitates other advanced search functions due to the structured nature of data.
It is usually difficult or even impossible to directly obtain the structures of the Web
sites’ backend databases without cooperation from the sites. Instead, the sites present two
other distinguishing structures: Interface schema and result schema. The interface schema
is the schema of the query interface, which exposes attributes that can be queried in the
backend database. The result schema is the schema of the query results, which exposes
attributes that are shown to users.
WEB MINING
Page 43
6. WEB USAGE MINING
Web Usage Mining is a part of Web Mining, which, in turn, is a part of Data
Mining. As Data Mining involves the concept of extraction meaningful and valuable
information from large volume of data, Web Usage mining involves mining the usage
characteristics of the users of Web Applications. This extracted information can then be
used in a variety of ways such as, improvement of the application, checking of fraudulent
elements etc.
Web Usage Mining is often regarded as a part of the Business Intelligence in an
organization rather than the technical aspect. It is used for deciding business strategies
through the efficient use of Web Applications. It is also crucial for the Customer
Relationship Management (CRM) as it can ensure customer satisfaction as far as the
interaction between the customer and the organization is concerned.
The major problem with Web Mining in general and Web Usage Mining in
particular is the nature of the data they deal with. With the upsurge of Internet in this
millennium, the Web Data has become huge in nature and a lot of transactions and usages
are taking place by the seconds. Apart from the volume of the data, the data is not
completely structured. It is in a semi-structured format so that it needs a lot of
preprocessing and parsing before the actual extraction of the required information. we
have taken up a small part of the Web Usage Mining process, which involves the
Preprocessing, User Identification, Bot removal and Analysis of the
6.1 WEB USAGE MINING ARCHITECTURE
The WEBMINER is a system that implements parts of this general architecture.
The architecture divides the Web usage mining process into two main parts. The rest part
includes the domain dependent processes of transforming the Web data into suitable
transaction form. This includes preprocessing, transaction identification, and data
integration components. The second part includes the largely domain independent
application of generic data mining and pattern matching techniques (such as the discovery
of association rule and sequential patterns) as part of the system's data mining engine. The
overall architecture for the Web mining process. Data cleaning is the first step performed
in the Web usage mining process. Some low level data integration tasks may also be
WEB MINING
Page 44
performed at this stage, such as combining multiple logs, incorporating referrer logs, etc.
After the data cleaning, the log entries must be partitioned into logical clusters using one
or a series of transaction identification modules. The goal of trans- action identification is
to create meaningful clusters of references for each user. The task of identifying transactions is one of either dividing a large transaction into multiple smaller ones or merging
small transactions into fewer larger ones. The input and output transaction formats match
so that any number of modules to be combined in any order, as the data analyst sees _t.
Once the domain-dependent data transformation phase is completed, the resulting
transaction data must be formatted to conform to the data model of the appropriate data
mining task. For instance, the format of the data for the association rule discovery task
may be different than the format necessary for mining sequential patterns. Finally, a query
mechanism will allow the user (analyst) to provide more control over the discovery
process by specifying various constraints.
6.2 WEB DATA
In Web Usage Mining, data can be collected in server logs, browser logs, proxy
logs, or obtained from an organization's database. These data collections differ in terms of
the location of the data source, the kinds of data available, the segment of population from
which the data was collected, and methods of implementation.
There are many kinds of data that can be used in Web Mining.
1. Content: The visible data in the Web pages or the information which was meant to be
imparted to the users. A major part of it includes text and graphics (images).
2. Structure: Data which describes the organization of the website. It is divided into two
types. Intra-page structure information includes the arrangement of various HTML or
XML tags within a given page. The principal kind of inter-page structure information
are the hyper-links used for site navigation.
3. Usage: Data that describes the usage patterns of Web pages, such as IP addresses, page
references, and the date and time of accesses and various other information depending
on the log format.
WEB MINING
Page 45
6.3 DATA SOURCES
The data sources used in Web Usage Mining may include web data repositories like:
1. WEB SERVER LOGS: These are logs which maintain a history of page requests. The
W3C maintains a standard format for web server log files, but other proprietary formats
exist. More recent entries are typically appended to the end of the file. Information about
the request, including client IP address, request date/time, page requested, HTTP code,
bytes served, user agent, and referrer are typically added.
These data can be combined into a single file, or separated into distinct logs, such
as an access log, error log, or referrer log. However, server logs typically do not collect
user-specific information. These files are usually not accessible to general Internet users,
only to the webmaster or other administrative person. A statistical analysis of the server
log may be used to examine traffic patterns by time of day, day of week, referrer, or user
agent. Efficient web site administration, adequate hosting resources and the fine tuning of
sales efforts can be aided by analysis of the web server logs. Marketing departments of
any organization that owns a website should be trained to understand these powerful tools.
A Web server log is an important source for performing Web Usage Mining
because it explicitly records the browsing behaviour of site visitors. The data recorded in
server logs reflects the (possibly concurrent) access of a Web site by multiple users. These
log files can be stored in various formats such as Common log or Extended log formats.
However, the site usage data recorded by server logs may not be entirely reliable due to
the presence of various levels of caching within the Web environment. Cached page views
are not recorded in a server log. In addition, any important information passed through the
POST method will not be available in a server log. Packet sniffing technology is an
alternative method to collecting usage data through server logs. Packet sniffers monitor
network traffic coming to a Web server and extract usage data directly from TCP/IP
packets. The Web server can also store other kinds of usage information such as cookies
and query data in separate logs. Cookies are tokens generated by the Web server for
individual client browsers in order to automatically track the site visitors. Tracking of
individual users is not an easy task due to the stateless connection model of the HTTP
protocol. Cookies rely on implicit user cooperation and thus have raised growing concerns
regarding user privacy. Query data is also typically generated by online visitors while
WEB MINING
Page 46
searching for pages relevant to their information needs. Besides usage data, the server side
also provides content data, structure information and Web page meta-information.
The Web server also relies on other utilities such as CGI scripts to handle data sent
back from client browsers. Web servers implementing the CGI standard parse the URI 1 of
the requested file to determine if it is an application program. The URI for CGI programs
may contain additional parameter values to be passed to the CGI application. Once the
CGI program has completed its execution, the Web server send the output of the CGI
application back to the browser.
2. PROXY SERVER LOGS: A Web proxy is a caching mechanism which lies between
client browsers and Web servers. It helps to reduce the load time of Web pages as well as
the network traffic load at the server and client side. Proxy server logs contain the HTTP
requests from multiple clients to multiple Web servers. This may serve as a data source to
discover the usage pattern of a group of anonymous users, sharing a common proxy
server.
A Web proxy acts as an intermediate level of caching between client browsers and
Web servers. Proxy caching can be used to reduce the loading time of a Web page
experienced by users as well as the network traffic load at the server and client sides. The
performance of proxy caches depends on their ability to predict future page requests
correctly. Proxy traces may reveal the actual HTTP requests from multiple clients to
multiple Web servers. This may serve as a data source for characterizing the browsing
behaviour of a group of anonymous users sharing a common proxy server.
3. BROWSER LOGS: Various browsers like Mozilla, Internet Explorer etc. can be
modified or various JavaScript and Java applets can be used to collect client side data.
This implementation of client-side data collection requires user cooperation, either in
enabling the functionality of the JavaScript and Java applets, or to voluntarily use the
modified browser. Client-side collection scores over server-side collection because it
reduces both the bot and session identification problems.
Client-side data collection can be implemented by using a remote agent such as
Java scripts or Java applets or by modifying the source code of an existing browser such as
Mosaic or Mozilla to enhance its data collection capabilities. The implementation of
WEB MINING
Page 47
client-side data collection methods requires user cooperation, either in enabling the
functionality of the Java scripts and Java applets, or to voluntarily use the modified
browser. Client-side collection has an advantage over server-side collection because it
ameliorates both the caching and session identification problems. However, Java applets
perform no better than server logs in terms of determining the actual view time of a page.
In fact, it may incur some additional overhead especially when the Java applet is loaded
for the first time. Java scripts, on the other hand, consume little interpretation time but
cannot capture all user clicks (such as reload or back buttons). These methods will collect
only single-user, single-site browsing behaviour. A modified browser is much more
versatile and will allow data collection about a single user over multiple Web sites. The
most difficult part of using this method is convincing the users to use the browser for their
daily browsing activities. This can be done by offering incentives to users who are willing
to use the browser, similar to the incentive programs ordered by companies such as
NetZero and All Advantage that reward users for clicking on banner advertisements while
surfing the Web.
6.4 INFORMATION OBTAINED
1. Number of Hits: This number usually signifies the number of times any resource is
accessed in a Website. A hit is a request to a web server for a file (web page, image,
JavaScript, Cascading Style Sheet, etc.). When a web page is uploaded from a server
the number of "hits" or "page hits" is equal to the number of files requested. Therefore,
one page load does not always equal one hit because often pages are made up of other
images and other files which stack up the number of hits counted.
2. Number of Visitors: A "visitor" is exactly what it sounds like. It's a human who
navigates to your website and browses one or more pages on your site.
3. Visitor Referring Website: The referring website gives the information or url of the
website which referred the particular website in consideration.
4. Visitor Referral Website: The referral website gives the information or url of the
website which is being referred to by the particular website in consideration.
5. Time and Duration: This information in the server logs give the time and duration for
how long the Website was accessed by a particular user.
WEB MINING
Page 48
6. Path Analysis: Path analysis gives the analysis of the path a particular user has
followed in accessing contents of a Website.
7. Visitor IP address: This information gives the Internet Protocol(I.P.) address of the
visitors who visited the Website in consideration.
8. Browser Type: This information gives the information of the type of browser that was
used for accessing the Website.
9. Cookies: A message given to a Web browser by a Web server. The browser stores the
message in a text file called cookie. The message is then sent back to the server each
time the browser requests a page from the server. The main purpose of cookies is to
identify users and possibly prepare customized Web pages for them. When you enter a
Web site using cookies, you may be asked to fill out a form providing such
information as your name and interests. This information is packaged into a cookie and
sent to your Web browser which stores it for later use. The next time you go to the
same Web site, your browser will send the cookie to the Web server. The server can
use this information to present you with custom Web pages. So, for example, instead
of seeing just a generic welcome page you might see a welcome page with your name
on it.
10. Platform: This information gives the type of Operating System etc. that was used to
access the Website.
6.5 POSSIBLE ACTIONS
1. Shortening Paths of High visit Pages: The pages which are frequently accessed by the
users can be seen as to follow a particular path. These pages can be included in an
easily accessible part of the Website thus resulting in the decrease in the navigation
path length.
2. Eliminating or Combining Low Visit Pages: The pages which are not frequently
accessed by users can be either removed or their content can be merged with pages
with frequent access.
WEB MINING
Page 49
3. Redesigning Pages to help User Navigation: To help the user to navigate through the
website in the best possible manner, the information obtained can be used to redesign
the structure of the Website.
4. Redesigning Pages For Search Engine Optimization: The content as well as other
information in the website can be improved from analyzing user patterns and this
information can be used to redesign pages for Search Engine Optimization so that the
search engines index the website at a proper rank.
5. Help Evaluating Effectiveness of Advertising Campaigns: Important and business
critical advertisements can be put up on pages that are frequently accessed.
6.6 WEB USAGE MINING PROCESS
6.6.1 PREPROCESSING
Data preprocessing describes any type of processing performed on raw data to
prepare it for another processing procedure. Commonly used as a preliminary data mining
practice, data preprocessing transforms the data into a format that will be more easily and
effectively processed for the purpose of the user. The different types of preprocessing in
Web Usage Mining are:
1. Usage Pre-Processing: Pre-Processing relating to Usage patterns of users.
Usage preprocessing is arguably the most difficult task in the Web Usage Mining process
due to the incompleteness of the available data. Unless a client side tracking mechanism is
used, only the IP address, agent, and server side click- stream are available to identify
users and server sessions. Some of the typically encountered problems are:

Single IP address/Multiple Server Sessions – Internet service providers (ISPs)
typically have a pool of proxy servers that users access the Web through. A single
proxy server may have several users accessing a Web site, potentially over the
same time period.

Multiple IP address/Single Server Session - Some ISPs or privacy tools randomly
assign each request from a user to one of several IP addresses. In this case, a single
server session can have multiple IP addresses.
WEB MINING
Page 50

Multiple IP address/Single User - A user that accesses the Web from different
machines will have a different IP address from session to session. This makes
tracking repeat visits from the same user difficult.

Multiple Agent/Singe User - Again, a user that uses more than one browser, even
on the same machine, will appear as multiple users.
Assuming each user has now been identified (through cookies, logins, or
IP/agent/path analysis), the click-stream for each user must be divided into sessions. Since
page requests from other servers are not typically available, it is difficult to know when a
user has left a Web site. A thirty minute timeout is often used as the default method of
breaking a user's click-stream into sessions. When a session ID is embedded in each URI,
the definition of a session is set by the content server.
While the exact content served as a result of each user action is often available
from the request field in the server logs, it is sometimes necessary to have access to the
content server information as well. Since content servers can maintain state variables for
each active session, the information necessary to determine exactly what content is served
by a user request is not always available in the URI. The final problem encountered when
preprocessing usage data is that of inferring cached page references. The only variable
method of tracking cached page views is to monitor usage from the client side. The
referrer field for each request can be used to detect some of the instances when cached
pages have been viewed. IP address 123.456.78.9 is responsible for three server sessions,
and IP addresses 209.456.78.2 and 209.45.78.3 are responsible for a fourth session. Using
a combination of referrer and agent information, lines 1 through 11 can be divided into
three sessions of A-B-F-O-G, L-R, and A-B-C-J. Path completion would add two page
references to the first session A-B-F-O-F-B-G, and one reference to the third session A-BA-C-J. Without using cookies, an embedded session ID, or a client-side data collection
method, there is no method for determining that lines 12 and 13 are actually a single server
session.
2. Content Pre-Processing: Pre-Processing of content accessed.
Content preprocessing consists of converting the text, image, scripts, and other les such as
multimedia into forms that are useful for the Web Usage Mining process. Often, this
consists of performing content mining such as classification or clustering. While applying
WEB MINING
Page 51
data mining to the content of Web sites is an interesting area of research in its own right,
in the context of Web Usage Mining the content of a site can be used to filter the input to,
or output from the pattern discovery algorithms. For example, results of a classification
algorithm could be used to limit the discovered patterns to those containing page views
about a certain subject or class of products. In addition to classifying or clustering page
views based on topics, page views can also be classified according to their intended use.
Page views can be intended to convey information (through text, graphics, or other
multimedia), gather information from the user, allow navigation (through a list of
hypertext links), or some combination uses. The intended use of a page view can also filter
the sessions before or after pattern discovery.
In order to run content mining algorithms on page views, the information must first
be converted into a quantifiable format. Some version of the vector space model is
typically used to accomplish this. Text files can be broken up into vectors of words.
Keywords or text descriptions can be substituted for graphics or multimedia. The content
of static page views can be easily preprocessed by parsing the HTML and reformatting the
information or running additional algorithms as desired. Dynamic page views present
more of a challenge. Content servers that employ personalization techniques and/or draw
upon databases to construct the page views may be capable of forming more page views
than can be practically preprocessed. A given set of server sessions may only access a
fraction of the page views possible for a large dynamic site. Also the content may be
revised on a regular basis. The content of each page view to be pre- processed must be
\assembled", either by an HTTP request from a crawler, or a combination of template,
script, and database accesses. If only the portion of page views that are accessed are
preprocessed, the output of any classification or clustering algorithms may be skewed.
3. Structure Pre-Processing: Pre-Processing related to structure of the website.
The structure of a site is created by the hypertext links between page views. The structure
can be obtained and pre- processed in the same manner as the content of a site. Again,
dynamic content (and therefore links) pose more problems than static page views. A
different site structure may have to be constructed for each server session.
WEB MINING
Page 52
6.6.1 PATTERN DISCOVERY:
Web Usage mining can be used to uncover patterns in server logs but is often carried out
only on samples of data. The mining process will be ineffective if the samples are not a
good representation of the larger body of data.
Pattern discovery draws upon methods and algorithms developed from several
fields such as statistics, data mining, machine learning and pattern recognition. However,
it is not the intent of this paper to describe all the available algorithms and techniques
derived from these fields. Interested readers should consult references such as. This
section describes the kinds of mining activities that have been applied to the Web domain.
Methods developed from other fields must take into consideration the different kinds of
data abstractions and prior knowledge available for Web Mining.
For example, in association rule discovery, the notion of a transaction for market-basket
analysis does not take into consideration the order in which items are selected. How- ever,
in Web Usage Mining, a server session is an ordered sequence of pages requested by a
user. Furthermore, due to the difficulty in identifying unique sessions, additional prior
knowledge is required (such as imposing a default timeout period, as was pointed out in
the previous section).
1. Statistical Analysis
Statistical techniques are the most common method to extract knowledge about visitors to
a Web site. By analyzing the session file, one can perform different kinds of descriptive
statistical analyses (frequency, mean, median, etc.) on variables such as page views,
viewing time and length of a navigational path. Many Web traffic analysis tools produce a
periodic report containing statistical information such as the most frequently accessed
pages, average view time of a page or average length of a path through a site. This report
may include limited low-level error analysis such as detecting unauthorized entry points or
finding the most common invalid URI. Despite lacking in the depth of its analysis, this
type of knowledge can be potentially useful for improving the system performance,
enhancing the security of the system, facilitating the site modification task, and providing
support for marketing decisions.
WEB MINING
Page 53
2. Association Rules
Association rule generation can be used to relate pages that are most often referenced
together in a single server session. In the context of Web Usage Mining, association rules
refer to sets of pages that are accessed together with a support value exceeding some
specified threshold. These pages may not be directly connected to one another via
hyperlinks. For example, association rule discovery using the Apriori algorithm may
reveal a correlation between users who visited a page containing electronic products to
those who access a page about sporting equipment. Aside from being applicable for
business and marketing applications, the presence or absence of such rules can help Web
designers to restructure their Web site. The association rules may also serve as a heuristic
for pre fetching documents in order to reduce user-perceived latency when loading a page
from a remote site.
3. Clustering
Clustering is a technique to group together a set of items having similar characteristics. In
the Web Usage domain, there are two kinds of interesting clusters to be discovered : usage
clusters and page clusters. Clustering of users tends to establish groups of users exhibiting
similar browsing pat- terns. Such knowledge is especially useful for inferring user
demographics in order to perform market segmentation in E-commerce applications or
provide personalized Web con- tent to the users. On the other hand, clustering of pages
will discover groups of pages having related content. This information is useful for
Internet search engines and Web assistance providers. In both applications, permanent or
dynamic HTML pages can be created that suggest related hyperlinks to the user according
to the user's query or past history of information needs.
4. Classification
Classification is the task of mapping a data item into one of several predefined classes. In
the Web domain, one is interested in developing a profile of users belonging to a particular
class or category. This requires extraction and selection of features that best describe the
properties of a given class or category. Classification can be done by using supervised
inductive learning algorithms such as decision tree classifiers, naive Bayesian classifiers,
k-nearest neighbour classifiers, Support Vector Machines etc. For example, classification
on server logs may lead to the discovery of interesting rules such as : 30% of users who
WEB MINING
Page 54
placed an online order in /Product/Music are in the 18-25 age group and live on the West
Coast.
5. Sequential Patterns
The technique of sequential pattern discovery attempts to find inter-session patterns such
that the presence of a set of items is followed by another item in a time-ordered set of
sessions or episodes. By using this approach, Web marketers can predict future visit
patterns which will be helpful in placing advertisements aimed at certain user groups.
Other types of temporal analysis that can be performed on sequential patterns includes
trend analysis, change point detection, or similarity analysis.
6. Dependency Modeling
Dependency modeling is another useful pattern discovery task in Web Mining. The goal
here is to develop a model capable of representing significant dependencies among the
various variables in the Web domain. As an example, one may be interested to build a
model representing the different stages a visitor undergoes while shopping in an online
store based on the actions chosen (i.e. from a casual visitor to a serious potential buyer).
There are several probabilistic learning techniques that can be employed to model the
browsing behaviour of users. Such techniques include Hidden Markov Models and
Bayesian Belief Networks. Modeling of Web usage patterns will not only provide a
theoretical framework for analyzing the behaviour of users but is potentially useful for
predicting future Web resource consumption. Such information may help develop
strategies to increase the sales of products offered by the Web site or improve the
navigational convenience of users.
6.6.3 PATTERN ANALYSIS
This is the final step in the Web Usage Mining process. After the preprocessing and
pattern discovery, the obtained usage patterns are analyzed to filter uninteresting
information and extract the useful information. The methods like SQL(Structured Query
Language) processing and OLAP (Online Analytical Processing) can be used.
The motivation behind pattern analysis is to filter out uninteresting rules or
patterns from the set found in the pattern discovery phase. The exact analysis methodology
is usually governed by the application for which Web mining is done. The most common
WEB MINING
Page 55
form of pattern analysis consists of a knowledge query mechanism such as SQL. Another
method is to load usage data into a data cube in order to perform OLAP operations.
Visualization techniques, such as graphing patterns or as- signing colours to different
values, can often highlight overall patterns or trends in the data. Content and structure
information can be used to filter out patterns containing pages of a certain usage type,
content type, or pages that match a certain hyperlink structure.
6.7 WEB USAGE MINING AREAS
1. Personalization
2. System Improvement
3. Site Modification
4. Business Intelligence
5. Usage Characterization
6.8 WEB USAGE MINING APPLICATIONS
1. LETIZIA
Letizia is an application that assists a user browsing the Internet. As the user operates a
conventional Web browser such as Mozilla, the application tracks usage patterns and
attempts to predict items of interest by performing concurrent and autonomous exploration
of links from the user's current position. The application uses a best-first search augmented
by heuristics inferring user interest from browsing behaviour.
2. WEBSIFT
The WebSIFT (Web Site Information Filter) system is another application which performs
Web Usage Mining from server logs recorded in the extended NSCA format (includes
referrer and agent fields), which is quite similar to the combined log format which used in
case of D Space log files. The preprocessing algorithms include identifying users, server
sessions, and identifying cached page references through the use of the referrer field. It
identifies interesting information and frequent item sets from mining usage data.
WEB MINING
Page 56
3. ADAPTIVE WEBSITES
An adaptive website adjusts the structure, content, or presentation of information in
response to measured user interaction with the site, with the objective of optimizing future
user interactions. Adaptive websites are web sites that automatically improve their
organization and presentation by learning from their user access patterns. User interaction
patterns may be collected directly on the website or may be mined from Web server logs.
A model or models are created of user interaction using artificial intelligence and
statistical methods. The models are used as the basis for tailoring the website for known
and specific patterns of user interaction.
6.9 ANALYSIS OF WEB SERVER LOGS
We used different web server log analyzers like Web Expert Lite 6.1 and Analog6.0 to
analyze various sample web server logs obtained. The key information obtained was: Total
Hits, Visitor Hits, Average Hits per Day, Average Hits per Visitor, Failed Requests, Page
Views Total Page Views, Average Page Views per Day, Average Page Views per Visitor,
Visitors Total Visitors Average Visitors per Day, Total Unique IPs , Bandwidth, Total
Bandwidth , Visitor Bandwidth , Average Bandwidth per Day, Average Bandwidth per
Hit, Average Bandwidth per Visitor. Access Data like files, images etc., Referrers, User
Agents etc.
Analysis of above obtained information proved Web Usage Mining as a powerful
technique in Web Site Management and improvement.
WEB MINING
Page 57
7. KEY CONCEPTS OF WEB MINING
In this section we briefly describe the new concepts introduced by the web mining
research community.
7.1 RANKING METRICS: FOR PAGE QUALITY AND RELEVANCE
Searching the web involves two main steps: Extracting the pages relevant to a
query and ranking them according to their quality. Ranking is important as it helps the
user look for “quality” pages that are relevant to the query. Different metrics have been
proposed to rank web pages according to their quality. We briefly discuss two of the
prominent ones.
PAGE RANK
Page Rank is a metric for ranking hypertext documents based on their quality.
Page, Brin, Motwani, and Winograd (1998) developed this metric for the popular search
engine Google (Brin and Page 1998). The key idea is that a page has a high rank if it is
pointed to by many highly ranked pages. So, the rank of a page depends upon the ranks of
the pages pointing to it. This process is done iteratively until the rank of all pages are
determined. Intuitively, the approach can be viewed as a stochastic analysis of a random
walk on the web graph. The first term in the right hand side of the equation is the
probability that a random web surfer arrives at a page p by typing the URL or from a
bookmark; or may have a particular page as his/her homepage. Here d is the probability
that the surfer chooses a URL directly, rather than traversing a link5 and 1−d is the
probability that a person arrives at a page by traversing a link. The second term in the right
hand side of the equation is the probability of arriving at a page by traversing a link.
HUBS AND AUTHORITIES
Hubs and authorities can be viewed as “fans’ and “centers” in a bipartite core of a
web graph, where the “fans” represent the hubs and the “centers” represent the authorities.
The hub and authority scores computed for each web page indicate the extent to which the
web page serves as a hub pointing to good authority pages or as an authority on a topic
WEB MINING
Page 58
pointed to by good hubs. The scores are computed for a set of pages related to a topic
using an iterative procedure called HITS (Kleinberg 1999). First a query is submitted to a
search engine and a set of relevant documents is retrieved. This set, called the “root set,” is
then expanded by including web pages that point to those in the “root set” and are pointed
by those in the “root set.” This new set is called the “base set.” An adjacency matrix, A is
formed such that if there exists at least one hyperlink from page i to page j, then Ai,j = 1,
otherwise Ai,j = 0. HITS algorithm is then used to compute the hub and authority scores
for these set of pages.
There have been modifications and improvements to the basic page rank and hubs
and authorities approaches such as SALSA (Lempel and Moran 2000), topic sensitive
page rank, (Haveliwala 2002) and web page reputations (Mendelzon and Rafiei 2000).
These different hyperlink based metrics have been discussed by Desikan, Srivastava,
Kumar, and Tan (2002).
7.2 ROBOT DETECTION AND FILTERING: SEPARATING HUMAN
AND NON HUMAN WEB BEHAVIOUR
Web robots are software programs that automatically traverse the hyperlink
structure of the web to locate and retrieve information. The importance of separating robot
behaviour from human behaviour prior to building user behaviour models has been
illustrated by Kohavi. First, e-commerce retailers are particularly concerned about the
unauthorized deployment of robots for gathering business intelligence at their web sites.
Second, web robots tend to consume considerable network bandwidth at the expense of
other users. Sessions due to web robots also make it difficult to perform click-stream
analysis effectively on the web data. Conventional techniques for detecting web robots are
based on identifying the IP address and user agent of the web clients. While these
techniques are applicable to many well-known robots, they are not sufficient to detect
camouflaged and previously unknown robots. Tan and Kumar proposed a classification
based approach that uses the navigational patterns in click-stream data to determine if it is
due to a robot. Experimental results have shown that highly accurate classification models
can be built using this approach. Furthermore, these models are able to discover many
camouflaged and previously unidentified robots.
WEB MINING
Page 59
7.3 INFORMATION SCENT: APPLYING FORAGING THEORY TO
BROWSING BEHAVIOUR
Information scent is a concept that uses the snippets of information present around the
links in a page as a “scent” to evaluate the quality of content of the page it points to, and
the cost of accessing such a page. The key idea is to model a user at a given page as
“foraging” for information, and following a link with a stronger “scent.” The “scent” of a
path depends on how likely it is to lead the user to relevant information, and is determined
by a network flow algorithm called spreading activation. The snippets, graphics, and other
information around a link are called “proximal cues.” The user’s desired information need
is expressed as a weighted keyword vector.
The similarity between the proximal cues and the user’s information need is
computed as “proximal scent.” With the proximal cues from all the links and the user’s
information need vector, a “proximal scent matrix” is generated. Each element in the
matrix reflects the extent of similarity between the link’s proximal cues and the user’s
information need. If enough information is not available around the link, a “distal scent” is
computed with the information about the link described by the contents of the pages it
points to. The proximal scent and the distal scent are then combined to give the scent
matrix. The probability that a user would follow a link is then decided by the scent or the
value of the element in the scent matrix.
7.4 USER PROFILES: UNDERSTANDING HOW SSERS BEHAVE
The web has taken user profiling to new levels. For example, in a “brick-and
mortar” store, data collection happens only at the checkout counter, usually called the
“point-of-sale.” This provides information only about the final outcome of a complex
human decision making process, with no direct information about the process itself. In an
on-line store, the complete click-stream is recorded, which provides a detailed record of
every action taken by the user, providing a much more detailed insight into the decision
making process. Adding such behavioural information to other kinds of information about
users, for example demographic, psychographic, and so on, allows a comprehensive user
profile to be built, which can be used for many different purposes. While most
organizations build profiles of user behaviour limited to visits to their own sites, there are
WEB MINING
Page 60
successful examples of building web-wide behavioural profiles such as Alexa Research
and DoubleClick. These approaches require browser cookies of some sort, and can provide
a fairly detailed view of a user’s browsing behaviour across the web.
7.5 INTERESTINGNESS MEASURES: WHEN MULTIPLE SOURCES
PROVIDE CONFLICTING EVIDENCE
One of the significant impacts of publishing on the web has been the close
interaction now possible between authors and their readers. In the pre web era, a reader’s
level of interest in published material had to be inferred from indirect measures such as
buying and borrowing, library checkout and renewal, opinion surveys, and in rare cases
feedback on the content. For material published on the web it is possible to track the clickstream of a reader to observe the exact path taken through on-line published material. We
can measure times spent on each page, the specific link taken to arrive at a page and to
leave it, etc. Much more accurate inferences about readers’ interest in content can be
drawn from these observations. Mining the user click-stream for user behaviour, and using
it to adapt the “look-and-feel” of a site to a reader’s needs was first proposed by Perkowitz
and Etzioni. While the usage data of any portion of a web site can be analyzed, the most
significant, and thus “interesting,” is the one where the usage pattern differs significantly
from the link structure. This is so because the readers’ behaviour, reflected by web usage,
is very different from what the author would like it to be, reflected by the structure created
by the author. Treating knowledge extracted from structure data and usage data as
evidence from independent sources, and combining them in an evidential reasoning
framework to develop measures for interestingness.
7.6 PREPROCESSING: MAKING WEB DATA SUITABLE FOR
MINING
In the panel discussion referred to earlier, preprocessing of web data to make it suitable for
mining was identified as one of the key issues for web mining. A significant amount of
work has been done in this area for web usage data, including user identification and
session creation, robot detection and filtering, and extracting usage path patterns. Cooley’s
Ph.D. dissertation provides a comprehensive overview of the work in web usage data
WEB MINING
Page 61
preprocessing. Preprocessing of web structure data, especially link information, has been
carried out for some applications, the most notable being Google style web search.
7.7 IDENTIFYING WEB COMMUNITIES OF INFORMATION
SOURCES
The web has had tremendous success in building communities of users and
information sources. Identifying such communities is useful for many purposes. Gibson,
Kleinberg, and Raghavan identified web communities as “a core of central authoritative
pages linked together by hub pages. Their approach was to discover emerging web
communities while crawling. A different approach to this problem was taken by Flake,
Lawrence, and Giles who applied the “maximum-flow minimum cut model” to the web
graph for identifying “web communities.” Compare HITS and the maximum flow
approaches and discuss the strengths and weakness of the two methods. Reddy and
Kitsuregawa propose a dense bipartite graph method, a relaxation to the complete bipartite
method followed by HITS approach, to find web communities. A related concept of
“friends and neighbours” was introduced by Adamic and Adar. They identified a group of
individuals with similar interests, who in the cyber-world would form a “community.”
Two people are termed “friends” if the similarity between their web pages is high.
Similarity is measured using features such as text, out-links, in-links and mailing lists.
7.8 ONLINE BIBILIOMETRICS
With the web having become the fastest growing and most up to date source of
information, the research community has found it extremely useful to have online
repositories of publications. Lawrence observed that having articles online makes them
more easily accessible and hence more often cited than articles that are offline. Such
online repositories not only keep the researchers updated on work carried out at different
centres but also makes the interaction and exchange of information much easier. With
such information stored in the web, it becomes easier to point to the most frequent papers
that are cited for a topic and also related papers that have been published earlier or later
than a given paper. This helps in understanding the state of the art in a particular field,
helping researchers to explore new areas. Fundamental web mining techniques are applied
to improve the search and categorization of research papers, and citing related articles.
WEB MINING
Page 62
7.9 VISUALIZATION OF THE WORLD WIDEWEB
Mining web data provides a lot of information, which can be better understood
with visualization tools. This makes concepts clearer than is possible with pure textual
representation. Hence, there is a need to develop tools that provide a graphical interface
that aids in visualizing results of web mining. Analyzing the web log data with
visualization tools has evoked a lot of interest in the research community. Chi, Pitkow,
Mackinlay, Pirolli, Gossweiler, and Card developed a web ecology and evolution
visualization (WEEV) tool to understand the relationship between web content, web
structure and web usage over a period of time. The site hierarchy is represented in a
circular form called the “Disk Tree” and the evolution of the web is viewed as a “Time
Tube.” Cadez, Heckerman, Meek, Smyth, and White present a tool called WebCANVAS
that displays clusters of users with similar navigation behaviour. Prasetyo, Pramudiono,
Takahashi, Toyoda, and Kitsuregawa developed Naviz, an interactive web log
visualization tool that is designed to display the user browsing pattern on the web site at a
global level, and then display each browsing path on the pattern displayed earlier in an
incremental manner. The support of each traversal is represented by the thickness of
theedge between the pages. Such a tool is very useful in analyzing user behaviour and
improving web sites.
WEB MINING
Page 63
8. PROMINENT APPLICATIONS
Excitement about the web in the past few years has led to the web applications
being developed at a much faster rate in the industry than research in web related
technologies. Many of these are based on the use of web mining concepts, even though the
organizations that developed these applications, and invented the corresponding
technologies, did not consider it as such. We describe some of the most successful
applications in this section. Clearly, realizing that these applications use web mining is
largely a retrospective exercise. For each application category discussed below, we have
selected a prominent representative, purely for exemplary purposes. This in no way
implies that all the techniques described were developed by that organization alone. On
the contrary, in most cases the successful techniques were developed by a rapid “copy and
improve” approach to each other’s ideas.
8.1 PERSONALIZED CUSTOMER EXPERIENCE IN B2C ECOMMERCE: AMAZON.COM
Early on in the life of Amazon.com, its visionary CEO Jeff Bezos observed, “In a
traditional (brick-and-mortar) store, the main effort is in getting a customer to the store.
Once a customer is in the store they are likely to make a purchase—since the cost of going
to another store is high—and thus the marketing budget (focused on getting the customer
to the store) is in general much higher than the in store customer experience budget (which
keeps the customer in the store). In the case of an on-line store, getting in or out requires
exactly one click, and thus the main focus must be on customer experience in the store.”
This fundamental observation has been the driving force behind Amazon’s
comprehensive approach to personalized customer experience, based on the mantra “a
personalized store for every customer” (Morphy 2001). A host of web mining techniques,
such as associations between pages visited and click-path analysis are used to improve the
customer’s experience during a “store visit.” Knowledge gained from web mining is the
key intelligence behind Amazon’s features such as “instant recommendations,” “purchase
circles,” “wish-lists,” etc.
WEB MINING
Page 64
8.2 WEB SEARCH: GOOGLE
Google is one of the most popular and widely used search engines. It provides
users access to information from over 2 billion web pages that it has indexed on its server.
The quality and quickness of the search facility makes it the most successful search
engine. Earlier search engines concentrated on web content alone to return the relevant
pages to a query. Google was the first to introduce the importance of the link structure in
mining information from the web. Page Rank, which measures the importance of a page, is
the underlying technology in all Google search products, and uses structural information
of the web graph to return high quality results.
The Google toolbar is another service provided by Google that seeks to make
search easier and informative by providing additional features such as highlighting the
query words on the returned web pages. The full version of the toolbar, if installed, also
sends the click-stream information of the user to Google. The usage statistics thus
obtained are used by Google to enhance the quality of its results. Google also provides
advanced search capabilities to search images and find pages that have been updated
within a specific date range. Built on top of Netscape’s Open Directory project, Google’s
web directory provides a fast and easy way to search within a certain topic or related
topics.
The advertising program introduced by Google targets users by providing
advertisements that are relevant to a search query. This does not bother users with
irrelevant ads and has increased the clicks for the advertising companies by four to five
times. According to B to B, a leading national marketing publication, Google was named a
top 10 advertising property in the Media Power 50 that recognizes the most powerful and
targeted business-to-business advertising outlets. One of the latest services offered by
Google is Google News. It integrates news from the online versions of all newspapers and
organizes them categorically to make it easier for users to read “the most relevant news.”
It seeks to provide latest information by constantly retrieving pages from news site
worldwide that are being updated on a regular basis. The key feature of this news page,
like any other Google service, is that it integrates information from various web news
sources through purely algorithmic means, and thus does not introduce any human bias or
WEB MINING
Page 65
effort. However, the publishing industry is not very convinced about a fully automated
approach to news distillation.
8.3 WEB-WIDE TRACKING: DOUBLECLICK
“Web-wide tracking,” i.e. tracking an individual across all sites he visits, is an
intriguing and controversial technology. It can provide an understanding of an individual’s
lifestyle and habits to a level that is unprecedented, which is clearly of tremendous interest
to marketers. A successful example of this is DoubleClick Inc.’s DART ad management
technology. DoubleClick serves advertisements, which can be targeted on demographic or
behavioural attributes, to the end-user on behalf of the client, i.e. the web site using
DoubleClick’s service. Sites that use DoubleClick’s service are part of The DoubleClick
Network and the browsing behaviour of a user can be tracked across all sites in the
network, using a cookie. This makes DoubleClick’s ad targeting to be based on very
sophisticated criteria. Alexa Research has recruited a panel of more than 500,000 users,
who have voluntarily agreed to have their every click tracked, in return for some freebies.
This is achieved through having a browser bar that can be downloaded by the panelist
from Alexa’s website, which gets attached to the browser and sends Alexa a complete
click-stream of the panelist’s web usage. Alexa was purchased by Amazon for its tracking
technology. Clearly web-wide tracking is a very powerful idea. However, the invasion of
privacy it causes has not gone unnoticed, and both Alexa/Amazon and Double Click have
faced very visible lawsuits. Microsoft’s Passport technology also falls into this category.
The value of this technology in applications such as cyber-threat analysis and homeland
defense is quite clear, and it might be only a matter of time before these organizations are
asked to provide information to law enforcement agencies.
8.4 UNDERSTANDING WEB COMMUNITIES: AOL
One of the biggest successes of America Online (AOL) has been its sizeable and
loyal customer base. A large portion of this customer base participates in various AOL
communities, which are collections of users with similar interests. In addition to providing
a forum for each such community to interact amongst themselves, AOL provides them
with useful information and services. Over time these communities have grown to be wellvisited waterholes for AOL users with shared interests. Applying web mining to the data
collected from community interactions provides AOL with a very good understanding of
WEB MINING
Page 66
its communities, which it has used for targeted marketing through advertisements and email solicitation. Recently, it has started the concept of “community sponsorship,”
whereby an organization, say Nike, may sponsor a community called “Young Athletic
Twenty Something.” In return, consumer survey and new product development experts of
the sponsoring organization get to participate in the community, perhaps without the
knowledge of other participants. The idea is to treat the community as a highly specialized
focus group, understand its needs and opinions on new and existing products, and also test
strategies for influencing opinions.
8.5 UNDERSTANDING AUCTION BEHAVIOUR: EBAY
As individuals in a society where we have many more things than we need, the
allure of exchanging our useless stuff for some cash, no matter how small, is quite
powerful. This is evident from the success of flea markets, garage sales and estate sales.
The genius of eBay’s founders was to create an infrastructure that gave this urge a global
reach, with the convenience of doing it from one’s home PC. In addition, it popularized
auctions as a product selling and buying mechanism and provides the thrill of gambling
without the trouble of having to go to Las Vegas. All of this has made eBay as one of the
most successful businesses of the internet era. Unfortunately, the anonymity of the web
has also created a significant problem for eBay auctions, as it is impossible to distinguish
real bids from fake ones. eBay is now using web mining techniques to analyze bidding
behaviour to determine if a bid is fraudulent (Colet 2002). Recent efforts are geared
towards understanding participants’ bidding behaviours/patterns to create a more efficient
auction market.
8.6 PERSONALIZED PORTAL FOR THE WEB: MYYAHOO
Yahoo was the first to introduce the concept of a “personalized portal,” i.e. a web
site designed to have the look-and-feel and content personalized to the needs of an
individual end-user. This has been an extremely popular concept and has led to the
creation of other personalized portals such as Yodlee for private information like bank and
brokerage accounts. Mining MyYahoo usage logs provides Yahoo valuable insight into an
individual’s web usage habits, enabling Yahoo to provide personalized content, which in
turn has led to the tremendous popularity of the Yahoo web site.
WEB MINING
Page 67
8.7
CITESEER:
DIGITAL
LIBRARY
AND
AUTONOMOUS
CITATION INDEXING
NEC Research Index, also known as CiteSeer is one of the most popular online
bibiliographic indices related to computer science. The key contribution of the CiteSeer
repository is its “Autonomous Citation Indexing” (ACI) (Lawrence, Giles, and Bollacker
1999). Citation indexing makes it possible to extract information about related articles.
Automating such a process reduces a lot of human effort, and makes it more effective and
faster. CiteSeer works by crawling the web and downloading research related papers.
Information about citations and the related context is stored for each of these documents.
The entire text and information about the document is stored in different formats.
Information about documents that are similar at a sentence level (percentage of sentences
that match between the documents), at a text level or related due to co citation is also
given. Citation statistics for documents are computed that enable the user to look at the
most cited or popular documents in the related field. They also maintain a directory for
computer science related papers, to make search based on categories easier. These
documents are ordered by the number of citations
WEB MINING
Page 68
9. RESEARCH DIRECTIONS
Although we are going through an inevitable phase of irrational despair following a phase
of irrational exuberance about the commercial potential of the web, the adoption and usage
of the web continues to grow unabated. As the web and its usage grows, it will continue to
generate ever more content, structure, and usage data, and the value of web mining will
keep increasing. Outlined here are some research directions that must be pursued to ensure
that we continue to develop web mining technologies that will enable this value to be
realized.
9.1 WEB METRICS AND MEASUREMENTS
From an experimental human behaviourist’s viewpoint, the web is the perfect
experimental apparatus. Not only does it provide the ability of measuring human
behaviour at a micro level, it eliminates the bias of the subjects knowing that they are
participating in an experiment, and allows the number of participants to be many orders of
magnitude larger than conventional studies. However, we have not yet begun to appreciate
the true impact of this revolutionary experimental apparatus for human behaviour studies.
The web Lab of Amazon is one of the early efforts in this direction. It is regularly used to
measure the user impact of various proposed changes, on operational metrics such as site
visits and visit/buy ratios, as well as on financial metrics such as revenue and profit, before
a deployment decision is made. For example, during Spring 2000 a 48 hour long
experiment on the live site was carried out, involving over one million user sessions,
before the decision to change Amazon’s logo was made. Research needs to be done in
developing the right set of web metrics, and their measurement procedures, so that various
web phenomena can be studied.
9.2 PROCESS MINING
Mining of market basket data, collected at the point-of-sale in any store, has been
one of the visible successes of data mining. However, this data provides only the end
result of the process, and that too decisions that ended up in product purchase. Clickstream data provides the opportunity for a detailed look at the decision making process
itself, and knowledge extracted from it can be used for optimizing, influencing the
process, etc. Underhill has conclusively proven the value of process information in
WEB MINING
Page 69
understanding users’ behaviour in traditional shops. Research needs to be carried out in (1)
extracting process models from usage data, (2) understanding how different parts of the
process model impact various web metrics of interest, and (3) how the process models
change in response to various changes that are made, i.e. changing stimuli to the user.
9.3 TEMPORAL EVOLUTION OF THE WEB
Society’s interaction with the web is changing the web as well as the way people
interact with each other. While storing the history all of this interaction in one place is
clearly too staggering a task, at least the changes to the web are being recorded by the
pioneering internet archive project. Research needs to be carried out in extracting temporal
models of how web content, web structures, web communities, authorities, hubs, etc.
evolve over time. Large organizations generally archive usage data from their web sites.
With these sources of data available, there is a large scope of research to develop
techniques for analyzing of how the web evolves over time.
9.4 WEB SERVICES PERFORMANCE OPTIMIZATION
As services over the web continue to grow, there will be a continuing need to make
them robust, scalable and efficient. Web mining can be applied to better understand the
behaviour of these services, and the knowledge extracted can be useful for various kinds
of optimizations. The successful application of web mining for predictive pre fetching of
pages by a browser has been demonstrated by Pandey, Srivastava, and Shekhar. It is
necessary to do analysis of the web logs for web services performance optimization.
Research is needed in developing web mining techniques to improve various other aspects
of web services.
9.5 FRAUD AND THREAT ANALYSIS
The anonymity provided by the web has led to a significant increase in attempted
fraud, from unauthorized use of individual credit cards to hacking into credit card
databases for blackmail purposes. Yet another example is auction fraud, which has been
increasing on popular sites like eBay. Since all these frauds are being perpetrated through
the internet, web mining is the perfect analysis technique for detecting and preventing
them. Research issues include developing techniques to recognize known frauds,
WEB MINING
Page 70
characterize them and recognize emerging frauds. The issues in cyber threat analysis and
intrusion detection are quite similar in nature.
9.6 WEB MINING AND PRIVACY
While there are many benefits to be gained from web mining, a clear drawback is
the potential for severe violations of privacy. Public attitude towards privacy seems to be
almost schizophrenic, i.e. people say one thing and do quite the opposite. For example,
famous cases like those involving Amazon and Doubleclick seem to indicate that people
value their privacy, while experience at major e-commerce portals shows that over 97% of
all people accept cookies with no problems, and most of them actually like the
personalization features that are provided based on it. Spiekerman, Grossklags, and
Berendt have demonstrated that people were willing to provide fairly personal information
about themselves, which was completely irrelevant to the task at hand, if provided the
right stimulus to do so. Furthermore, explicitly bringing attention to information privacy
policies had practically no effect. One explanation of this seemingly contradictory attitude
towards privacy may be that we have a bi-modal view of privacy, namely that “I’d be
willing to share information about myself as long as I get some (tangible or intangible)
benefits from it, and as long as there is an implicit guarantee that the information will not
be abused.” The research issue generated by this attitude is the need to develop
approaches, methodologies and tools that can be used to verify and validate that a web
service is indeed using user’s information in a manner consistent with its stated policies.
WEB MINING
Page 71