Download “Clustering Algorithm Employ in Web Usage Mining”: An Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Proceedings of the 5th National Conference; INDIACom-2011
Computing For Nation Development, March 10 – 11, 2011
Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi
“Clustering Algorithm Employ in Web Usage Mining”: An Overview
Harish Kumar1 and Anil Kumar2
PhD Scholar, Mewar University, Chittorgarh.
2
Professor, MIET,Meerut.
1
[email protected] and [email protected]
1
ABSTRACT
The Internet is one of the fastest embryonic areas of
information gathering. Web users leave many records of their
doings in the form of data while working on internet. The huge
amount of these data is used as a row material for information
and knowledge gathering. Proper mining processes are needed
for this information. Web usage mining, also known as Web
Log Mining, is the process of extracting interesting patterns in
web access logs. Web servers record and collect data about
user interactions whenever desires for resources are received.
Analyzing the web access logs of different web sites can help
understand the user behavior and the web structure, thereby
improving the design of this huge collection of resources. Web
server log files and customers navigation data that can be
mined meaningfully and user access patterns is forecast to
identify web user’s behavior. Clustering algorithm is effective
and easy to achieve with satisfactory results. It selects behavior
from search web logs obtained from previous search sessions
and extracts user’s behavior by using clustering and path
designing algorithms. Clustering algorithm, identify behavior
pattern from the cleaned web log data. This helps in grouping
web data into “Classes” so that similar objects are in the same
class and dissimilar objects are in different class. Path
optimization of web tree structure is used to reduce the
searching path for a web page.
KEYWORD
KDD, Web mining, clustering, classes.
1. INTRODUCTION
The volatile expansion of online data due to the Internet and the
common use of databases have formed huge need for KDD
methodologies. Knowledge Discovery and Data Mining (KDD)
is an interdisciplinary area focusing upon methodologies for
mining useful information or knowledge from data [1]. Here
users leave navigation traces, which can be pulled up as a basis
for a user behavior analysis. In the field of web applications
similar analyses have been successfully executed by methods
of Web Usage Mining [2][3]. The challenge of extracting
knowledge from data draws upon research in statistics,
databases, pattern recognition, machine learning, data
visualization, optimization, web user behavior and highperformance computing, to deliver advanced business
intelligence and web discovery solutions[3][4]. It is a powerful
technology with great potential to help various industries focus
on the most important information in their data warehouses.
Data mining can be viewed as a result of the natural evolution
of information technology.
2. THE FOUNDATIONS OF DATA MINING
This information collection through data mining has allowed
companies to make thousands and thousands of dollars in
revenues by being able to better use the internet to gain
business intelligence that helps companies make vital business
decisions[3]. The evolution began when business data was first
stored on computers, continued with improvements in data
access, and more recently, generated technologies that allow
Develop
ment
Steps
Data
Collectio
n (1960)
Data
Access
(1980)
Data
Warehou
sing
&
Decision
Support
(1990)
Data
Mining
(Emergin
g Today)
Facilitate
Technologi
es
Computers,
disks, Tape
Drives
RDBMS,
SQL,ODBC
Product
Providers
Characteristic
s
IBM,
Retrospective,
static
data
delivery
Dynamic data
delivery
at
record level
Oracle,
Sybase,
IBM,
Microsoft
Pilot,
Comshare,
Arbor,
OLAP,
Retrospective,
multidimen
dynamic data
sional
delivery
at
databases,
multiple levels
data
warehouses
Advanced
Pilot,
Prospective,
algorithms,
Lockheed,
proactive
multiproces IBM, SGI,
information
sor
delivery
computers,
massive
databases
users to navigate through their data in real time. Data mining
takes this evolutionary process beyond retrospective
Table 1: Evolution of Data Mining.
data access and navigation to prospective and proactive
information delivery and commercial databases are growing at
unprecedented rates. A recent META Group survey of data
warehouse projects found that 19% of respondents are beyond
the 50 gigabyte level, while 59% expect to be there by second
quarter of 1996 [1]. Table1 shows the evolution steps of data
mining. The core components of data mining technology have
Copy Right © INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7
Proceedings of the 5th National Conference; INDIACom-2011
been under development for decades, in research areas such as
statistics, artificial intelligence, and machine learning.
3. WEB DATA MINING
The term Web Data Mining is a technique used to crawl
through various web resources to collect required information,
which enables an individual or a company to promote business,
understanding marketing dynamics, new promotions floating
on the Internet, etc. In particular, our focus is on web data
mining research in context of our web user behavior of banking
sector. Web data mining is categorized into three areas: Web
Content Mining (WCM), Web Structure Mining (WSM) and
Web Usage Mining (WUM)[1].
Figure1: Categorization of Web Data mining
Web content mining (WCM) is to find useful information in the
content of web pages e.g. free Semi-structured data such as
HTML code, pictures, and various unloaded files. Web
structure mining (WSM) is use to generating a structural
summary about the web site and web pages. Web structure
mining tries to discover the link structure of the hyperlinks at
the inter document level. Web content mining mainly focuses
on the structure of inner document, Web usage mining (WUM)
is applied to the data generated by visits to a web site,
especially those contained in web log files. I only highlighted
and discussed research issues involved in web usage data
mining .I believe that web usage data mining behavior will be
the topic of exploratory research in near future.
3.1 WEB USAGE MINING
Web usage mining integrates the technique data Mining and
Internet. WWW is an immense source of data that can be
collected from web content or from the web usage. Web usage
mining is the type of web mining activity that involves the
automatic discovery of user access patterns from one or more
web servers. Organizations often generate and collect large
volumes of data in their daily operations. Most of this
information is usually generated automatically by web servers
and collected in server access logs.
Figure2: Categorization of Web Usage Mining
In banking Web Mining, data can be collected at the serverside, client-side, proxy servers, or gathered from various
resource. In web data mining data is of four types.
1. Content
2. Structure
3. Usage
4. User Profile.
Web usage data explains the pattern of usage of Web pages,
such as IP addresses, page references, and the date and time of
accesses. Web servers record and accumulate data about user
interactions whenever requests for resources are received.
Analyzing the web access logs of different web sites can help
understand the user behavior and the web structure, thereby
improving the design of this colossal collection of resources.
The link distance between two pages X and Y is the minimum
number of hyperlinks the user has to move to go from X to Y.
We need to perform three main tasks for web usage mining
which are as follows:
1. Preprocessing.
2. Pattern Discovery.
3. Pattern Analysis.
Figure3: Web Usage Mining Tasks
3.1.1 Preprocessing
The preprocessing involves cleaning and structuring web data
to prepare it for the pattern discovery work. Web Usage data
consists of noise and missing data. Web usage data is
depending on Web site's structure and web server technology.
A simple Web site of simple HTML pages will generate less
usage data. A graphical site or dynamic site will generate more
data and noise.
1. Data Cleaning: In this step all noise and irrelevant data is
removed from data log file. When requesting a Web page
containing additional Web resources like images or script files,
several implicit requests will be generated by the Web browser.
If these requests are still present when the data mining step is
Copy Right © INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7
“Clustering Algorithm Employ in Web Usage Mining”: An Overview
performed, uninteresting patterns like “Page, Image1, Image3,
Image6" may be found, making the pattern analysis step more
complex [1].
2.
Structuration: In Structuration, the requests from the
raw log file are grouped by user, user session, page view, visit
and episode. The structure of a site is created by the hypertext
links between page views. The structure can be obtained and
preprocessed in the same manner as the content of a site.
Again, dynamic content (and therefore links) pose more
problems than static page views. A different site structure may
have to be constructed for each server session.
3.1.2 Pattern Discovery: Once the raw logs have been
preprocessed, data mining techniques can be applied on the
dataset to discover new patterns .Pattern discovery describe on
various methods and algorithms developed on statistics, data
mining, and pattern recognition[4]. Pattern discovery
describes the type of mining technique that has been applied
to the Web domain. In Web Usage Mining, a server session is
an ordered sequence of pages requested by a user.
Furthermore, due to the difficulty in identifying unique
sessions, additional prior knowledge is required. Some
important techniques of pattern discovery are as follows.
A) Association rules mining,
B) Sequential pattern mining
C) Clustering.
3.1.3 Pattern Analysis
The WUM has a strange consequence and the analysis of
these patterns allows distinguishing interesting results from
non-interesting ones. The issue with WUM results is that it is
extremely hard to capture and define the notion of
interestingness [4]. In this step of the WUM process, the
analyst is interested in projecting the patterns discovered on
the Web site structure or on its content. For instance, the
analyst may be interested to see on which sections of the Web
site are situated, the pages contained in the most frequent
sequential[4].
4. WEB LOG FILE
An extended log file contains a sequence of lines containing
ASCII characters terminated by either the sequence LF or
CRLF. Log file generators should follow the line termination
convention for the platform on which they are executed.
Analyzers should accept either form. Each line may contain
either a directive or an entry. Entries consist of a sequence of
fields relating to a single HTTP transaction. Fields are
separated by whitespace; the use of tab characters for this
purpose is encouraged. If a field is unused in a particular entry
dash "-" marks the omitted field. Directives record
information about the logging process itself. Lines beginning
with the # character contain directives. The following
directives are defined:
Version: <integer>.<integer>
The version of the extended log file format used. This draft
defines version 1.0.
Fields: [<specifier>...]
Specifies the fields recorded in the log.
Software: string
Identifies the software which generated the log.
Start-Date: <date> <time>
The date and time at which the log was started.
End-Date:<date> <time>
The date and time at which the log was finished.
Date:<date> <time>
The date and time at which the entry was added.
Remark: <text>
Comment information. Data recorded in this field should be
ignored by analysis tools.
Log file is of many types
1. Access Log File
2. Proxy Access Log File
3. Cache Access Log
4. Error Log File
5. LogFileDateExt
The directives Version and Fields are required and should
precede all entries in the log. The Fields directive specifies the
data recorded in the fields of each entry.
5. WEB CLUSTERING
Clustering is the process of assembling the data into classes or
clusters so that objects within a cluster have high similarity in
relationship to another, but are very dissimilar to objects in
other clusters. Data clustering is under vigorous development
and is applied to many application areas including business,
biology, medicine, chemistry, etc. Owing to the huge amounts
of data collected in databases, cluster analysis has recently
become a highly active topic in data mining research [1][3]. For
cluster analysis to work efficiently and effectively, as many
literatures have presented, there are the following typical
requirements of clustering in data mining
1. Scalability:
2. Ability to deal with different types of attributes:
3. Discovery of clusters with arbitrary shape.
4. Minimal requirements for domain knowledge to
determine input parameters:
5. Ability to deal with noisy data:
6. Insensitivity to the order of input records:
7. High dimensionality:
The research is focused on finding user behavior by using
efficient and effective cluster analysis.
5.1 BASIC CLUSTERING STEP
5.1.1 Preprocessing and feature selection
Most clustering models assume that n-dimensional feature
vectors represent all data items. This step therefore involves
choosing an appropriate feature, and doing appropriate
preprocessing and feature extraction on data items to measure
the values of the chosen feature set[2][7]. It will often be
Copy Right © INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7
Proceedings of the 5th National Conference; INDIACom-2011
desirable to choose a subset of all the features available, to
reduce the dimensionality of the problem space. This step often
requires a good deal of domain knowledge and data analysis.
5.1.2 Similarity measure
Similarity measure plays an important role in the process of
clustering where a set of objects are grouped into several
clusters, so that similar objects will be in the same cluster and
dissimilar ones in different cluster. In clustering, its features
represent an object and the similarity relationship between
objects is measured by a similarity function. This is a function,
which takes two sets of data items as input, and returns as
output a similarity measure between them.
5.1.3 Clustering algorithm
Clustering algorithms are general schemes, which use
particular similarity measures as subroutines. The particular
choice of clustering algorithms depends on the desired
properties of the final clustering. Other considerations include
the usual time and space complexity [7] [8]. A clustering
algorithm attempts to find natural groups of components (or
data) based on some similarity. The clustering algorithm also
finds the centroid of a group of data sets. To determine cluster
membership, most algorithms evaluate the distance between a
point and the cluster centroid. The output from a clustering
algorithm is basically a statistical description of the cluster
centroid with the number of components in each cluster.
5.1.4 Result validation
Do the results make sense? If not, we may want to iterate back
to some prior stage. It may also be useful to do a test of
clustering tendency, to try to guess if clusters are present at all;
note that any clustering algorithm will produce some clusters
regardless of whether or not natural clusters exist [9][10].
5.2 Clustering Algorithm:
5.2.1 Hierarchical algorithms: HA provide a hierarchical
grouping of the objects. There exist two approaches, the
bottom-up and the top-down approach [6]. In case of bottom-up
approach, at the beginning of the algorithm each object
represents a different cluster and at the end all objects belong to
the same cluster. In case of top-down method at the start of the
algorithm all objects belong to the same cluster which is split,
until each object constitute a different cluster. A key aspect in
these kinds of algorithms is the definition of the distance
measurements between the objects and between the clusters.
The advantage of the hierarchical algorithms is that the
validation indices (correlation, inconsistency measure), which
can be defined on the clusters, can be used for determining the
number of the clusters.
5.2.2Density-based algorithms start by searching for core
objects, and they are growing the clusters based on these cores
and by searching for objects that are in a neighborhood within a
radius of a given object[7]. The advantage of these types of
algorithms is that they can detect arbitrary form of clusters and
it can filter out the noise.
5.2.3 Grid-based algorithms: GBA the grid-based algorithms
use a hierarchical grid structure to decompose the object space
into finite number of cells [6] [7]. The advantage of this
approach is the fast processing time that is in general
independent of the number of data objects.
CONCLUSION
This main goal of this paper is to analyzing hidden information
from large amount of log data. This paper emphasizes on
clustering among the different mining processes. We define
various clustering algorithm for similar kind of web access
pattern. These algorithms serve as foundation for the web usage
clustering that were described and we conclude that web
mining methods and clustering technique are used for selfadaptive websites and intelligent websites to provide
personalized service and performance optimization.
REFERENCES
[1] Ajith Abraham,”Business Intelligence from Web Usage
Mining” Journal of Information & Knowledge Management,
Vol. 2, No. 4 (2003) 375-390
[2] M.N. Murty, A.K. Jain, P.J. Flynn, “Data clustering: a
review”,ACM Computer. Survey. 31 (3) (1999) 64– 323.
[3] Hengshan Wang, Cheng Yang, Hua Zeng “ Design and
Implementation of a Web Usage Mining Model Based On
Fpgrowth and Prefixspan, Communications of the IIMA,
Volume 6 Issue 2
[4] Jaideep Srivastava_ y , Robert Cooleyz , Mukund
Deshpande, Pang-Ning Tan ”Web Usage Mining: Discovery
and Applications of UsagePatterns from Web Data” Volume 1
Issue 2-Page13
[5] V.V.R. Maheswara Rao , Dr. V. Valli Kumari , Dr.
KVSVN Raju “Understanding User Behavior using Web Usage
Mining” International Journal of Computer Applications (0975
– 8887) Volume 1 – No. 7
[6] Ji He,Man Lan, Chew-Lim Tan,Sam-Yuan Sung, HweeBoonLow, “Initialization of Cluster refinement algorithms: a
review and comparative study”, Proceeding of International
Joint Conference on Neural Networks[C].Budapest,2004.
[7] Renata Ivancsy, Ferenc Kovacs “Clustering Techniques
Utilized in Web Usage Mining” International Conference on
Artificial Intelligence, Knowledge Engineering and Data Bases,
Madrid, Spain, February 15-17, 2006 (pp237-242)
[8] M.N. Murty, A.K. Jain, P.J. Flynn, “Data clustering: a
review”,ACM Computer. Survey. 31 (3) (1999) 64– 323.
[9] Bradley P S, Fayyad U M. “Refining Initial Points for
Kmeans,Clustering Advances in Knowledge Discovery and
Data Mining”, MIT Press.
[10] Ruoming Jin , Anjan Goswami and Gagan Agrawal. “Fast
and exact out-of-core and distributed k-means clustering
Knowledge and Information Systems”, Volume 10, Number
1/July, 2006.
Copy Right © INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7