Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of the 5th National Conference; INDIACom-2011 Computing For Nation Development, March 10 – 11, 2011 Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi “Clustering Algorithm Employ in Web Usage Mining”: An Overview Harish Kumar1 and Anil Kumar2 PhD Scholar, Mewar University, Chittorgarh. 2 Professor, MIET,Meerut. 1 [email protected] and [email protected] 1 ABSTRACT The Internet is one of the fastest embryonic areas of information gathering. Web users leave many records of their doings in the form of data while working on internet. The huge amount of these data is used as a row material for information and knowledge gathering. Proper mining processes are needed for this information. Web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs. Web servers record and collect data about user interactions whenever desires for resources are received. Analyzing the web access logs of different web sites can help understand the user behavior and the web structure, thereby improving the design of this huge collection of resources. Web server log files and customers navigation data that can be mined meaningfully and user access patterns is forecast to identify web user’s behavior. Clustering algorithm is effective and easy to achieve with satisfactory results. It selects behavior from search web logs obtained from previous search sessions and extracts user’s behavior by using clustering and path designing algorithms. Clustering algorithm, identify behavior pattern from the cleaned web log data. This helps in grouping web data into “Classes” so that similar objects are in the same class and dissimilar objects are in different class. Path optimization of web tree structure is used to reduce the searching path for a web page. KEYWORD KDD, Web mining, clustering, classes. 1. INTRODUCTION The volatile expansion of online data due to the Internet and the common use of databases have formed huge need for KDD methodologies. Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon methodologies for mining useful information or knowledge from data [1]. Here users leave navigation traces, which can be pulled up as a basis for a user behavior analysis. In the field of web applications similar analyses have been successfully executed by methods of Web Usage Mining [2][3]. The challenge of extracting knowledge from data draws upon research in statistics, databases, pattern recognition, machine learning, data visualization, optimization, web user behavior and highperformance computing, to deliver advanced business intelligence and web discovery solutions[3][4]. It is a powerful technology with great potential to help various industries focus on the most important information in their data warehouses. Data mining can be viewed as a result of the natural evolution of information technology. 2. THE FOUNDATIONS OF DATA MINING This information collection through data mining has allowed companies to make thousands and thousands of dollars in revenues by being able to better use the internet to gain business intelligence that helps companies make vital business decisions[3]. The evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow Develop ment Steps Data Collectio n (1960) Data Access (1980) Data Warehou sing & Decision Support (1990) Data Mining (Emergin g Today) Facilitate Technologi es Computers, disks, Tape Drives RDBMS, SQL,ODBC Product Providers Characteristic s IBM, Retrospective, static data delivery Dynamic data delivery at record level Oracle, Sybase, IBM, Microsoft Pilot, Comshare, Arbor, OLAP, Retrospective, multidimen dynamic data sional delivery at databases, multiple levels data warehouses Advanced Pilot, Prospective, algorithms, Lockheed, proactive multiproces IBM, SGI, information sor delivery computers, massive databases users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective Table 1: Evolution of Data Mining. data access and navigation to prospective and proactive information delivery and commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996 [1]. Table1 shows the evolution steps of data mining. The core components of data mining technology have Copy Right © INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7 Proceedings of the 5th National Conference; INDIACom-2011 been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. 3. WEB DATA MINING The term Web Data Mining is a technique used to crawl through various web resources to collect required information, which enables an individual or a company to promote business, understanding marketing dynamics, new promotions floating on the Internet, etc. In particular, our focus is on web data mining research in context of our web user behavior of banking sector. Web data mining is categorized into three areas: Web Content Mining (WCM), Web Structure Mining (WSM) and Web Usage Mining (WUM)[1]. Figure1: Categorization of Web Data mining Web content mining (WCM) is to find useful information in the content of web pages e.g. free Semi-structured data such as HTML code, pictures, and various unloaded files. Web structure mining (WSM) is use to generating a structural summary about the web site and web pages. Web structure mining tries to discover the link structure of the hyperlinks at the inter document level. Web content mining mainly focuses on the structure of inner document, Web usage mining (WUM) is applied to the data generated by visits to a web site, especially those contained in web log files. I only highlighted and discussed research issues involved in web usage data mining .I believe that web usage data mining behavior will be the topic of exploratory research in near future. 3.1 WEB USAGE MINING Web usage mining integrates the technique data Mining and Internet. WWW is an immense source of data that can be collected from web content or from the web usage. Web usage mining is the type of web mining activity that involves the automatic discovery of user access patterns from one or more web servers. Organizations often generate and collect large volumes of data in their daily operations. Most of this information is usually generated automatically by web servers and collected in server access logs. Figure2: Categorization of Web Usage Mining In banking Web Mining, data can be collected at the serverside, client-side, proxy servers, or gathered from various resource. In web data mining data is of four types. 1. Content 2. Structure 3. Usage 4. User Profile. Web usage data explains the pattern of usage of Web pages, such as IP addresses, page references, and the date and time of accesses. Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behavior and the web structure, thereby improving the design of this colossal collection of resources. The link distance between two pages X and Y is the minimum number of hyperlinks the user has to move to go from X to Y. We need to perform three main tasks for web usage mining which are as follows: 1. Preprocessing. 2. Pattern Discovery. 3. Pattern Analysis. Figure3: Web Usage Mining Tasks 3.1.1 Preprocessing The preprocessing involves cleaning and structuring web data to prepare it for the pattern discovery work. Web Usage data consists of noise and missing data. Web usage data is depending on Web site's structure and web server technology. A simple Web site of simple HTML pages will generate less usage data. A graphical site or dynamic site will generate more data and noise. 1. Data Cleaning: In this step all noise and irrelevant data is removed from data log file. When requesting a Web page containing additional Web resources like images or script files, several implicit requests will be generated by the Web browser. If these requests are still present when the data mining step is Copy Right © INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7 “Clustering Algorithm Employ in Web Usage Mining”: An Overview performed, uninteresting patterns like “Page, Image1, Image3, Image6" may be found, making the pattern analysis step more complex [1]. 2. Structuration: In Structuration, the requests from the raw log file are grouped by user, user session, page view, visit and episode. The structure of a site is created by the hypertext links between page views. The structure can be obtained and preprocessed in the same manner as the content of a site. Again, dynamic content (and therefore links) pose more problems than static page views. A different site structure may have to be constructed for each server session. 3.1.2 Pattern Discovery: Once the raw logs have been preprocessed, data mining techniques can be applied on the dataset to discover new patterns .Pattern discovery describe on various methods and algorithms developed on statistics, data mining, and pattern recognition[4]. Pattern discovery describes the type of mining technique that has been applied to the Web domain. In Web Usage Mining, a server session is an ordered sequence of pages requested by a user. Furthermore, due to the difficulty in identifying unique sessions, additional prior knowledge is required. Some important techniques of pattern discovery are as follows. A) Association rules mining, B) Sequential pattern mining C) Clustering. 3.1.3 Pattern Analysis The WUM has a strange consequence and the analysis of these patterns allows distinguishing interesting results from non-interesting ones. The issue with WUM results is that it is extremely hard to capture and define the notion of interestingness [4]. In this step of the WUM process, the analyst is interested in projecting the patterns discovered on the Web site structure or on its content. For instance, the analyst may be interested to see on which sections of the Web site are situated, the pages contained in the most frequent sequential[4]. 4. WEB LOG FILE An extended log file contains a sequence of lines containing ASCII characters terminated by either the sequence LF or CRLF. Log file generators should follow the line termination convention for the platform on which they are executed. Analyzers should accept either form. Each line may contain either a directive or an entry. Entries consist of a sequence of fields relating to a single HTTP transaction. Fields are separated by whitespace; the use of tab characters for this purpose is encouraged. If a field is unused in a particular entry dash "-" marks the omitted field. Directives record information about the logging process itself. Lines beginning with the # character contain directives. The following directives are defined: Version: <integer>.<integer> The version of the extended log file format used. This draft defines version 1.0. Fields: [<specifier>...] Specifies the fields recorded in the log. Software: string Identifies the software which generated the log. Start-Date: <date> <time> The date and time at which the log was started. End-Date:<date> <time> The date and time at which the log was finished. Date:<date> <time> The date and time at which the entry was added. Remark: <text> Comment information. Data recorded in this field should be ignored by analysis tools. Log file is of many types 1. Access Log File 2. Proxy Access Log File 3. Cache Access Log 4. Error Log File 5. LogFileDateExt The directives Version and Fields are required and should precede all entries in the log. The Fields directive specifies the data recorded in the fields of each entry. 5. WEB CLUSTERING Clustering is the process of assembling the data into classes or clusters so that objects within a cluster have high similarity in relationship to another, but are very dissimilar to objects in other clusters. Data clustering is under vigorous development and is applied to many application areas including business, biology, medicine, chemistry, etc. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research [1][3]. For cluster analysis to work efficiently and effectively, as many literatures have presented, there are the following typical requirements of clustering in data mining 1. Scalability: 2. Ability to deal with different types of attributes: 3. Discovery of clusters with arbitrary shape. 4. Minimal requirements for domain knowledge to determine input parameters: 5. Ability to deal with noisy data: 6. Insensitivity to the order of input records: 7. High dimensionality: The research is focused on finding user behavior by using efficient and effective cluster analysis. 5.1 BASIC CLUSTERING STEP 5.1.1 Preprocessing and feature selection Most clustering models assume that n-dimensional feature vectors represent all data items. This step therefore involves choosing an appropriate feature, and doing appropriate preprocessing and feature extraction on data items to measure the values of the chosen feature set[2][7]. It will often be Copy Right © INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7 Proceedings of the 5th National Conference; INDIACom-2011 desirable to choose a subset of all the features available, to reduce the dimensionality of the problem space. This step often requires a good deal of domain knowledge and data analysis. 5.1.2 Similarity measure Similarity measure plays an important role in the process of clustering where a set of objects are grouped into several clusters, so that similar objects will be in the same cluster and dissimilar ones in different cluster. In clustering, its features represent an object and the similarity relationship between objects is measured by a similarity function. This is a function, which takes two sets of data items as input, and returns as output a similarity measure between them. 5.1.3 Clustering algorithm Clustering algorithms are general schemes, which use particular similarity measures as subroutines. The particular choice of clustering algorithms depends on the desired properties of the final clustering. Other considerations include the usual time and space complexity [7] [8]. A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroid. The output from a clustering algorithm is basically a statistical description of the cluster centroid with the number of components in each cluster. 5.1.4 Result validation Do the results make sense? If not, we may want to iterate back to some prior stage. It may also be useful to do a test of clustering tendency, to try to guess if clusters are present at all; note that any clustering algorithm will produce some clusters regardless of whether or not natural clusters exist [9][10]. 5.2 Clustering Algorithm: 5.2.1 Hierarchical algorithms: HA provide a hierarchical grouping of the objects. There exist two approaches, the bottom-up and the top-down approach [6]. In case of bottom-up approach, at the beginning of the algorithm each object represents a different cluster and at the end all objects belong to the same cluster. In case of top-down method at the start of the algorithm all objects belong to the same cluster which is split, until each object constitute a different cluster. A key aspect in these kinds of algorithms is the definition of the distance measurements between the objects and between the clusters. The advantage of the hierarchical algorithms is that the validation indices (correlation, inconsistency measure), which can be defined on the clusters, can be used for determining the number of the clusters. 5.2.2Density-based algorithms start by searching for core objects, and they are growing the clusters based on these cores and by searching for objects that are in a neighborhood within a radius of a given object[7]. The advantage of these types of algorithms is that they can detect arbitrary form of clusters and it can filter out the noise. 5.2.3 Grid-based algorithms: GBA the grid-based algorithms use a hierarchical grid structure to decompose the object space into finite number of cells [6] [7]. The advantage of this approach is the fast processing time that is in general independent of the number of data objects. CONCLUSION This main goal of this paper is to analyzing hidden information from large amount of log data. This paper emphasizes on clustering among the different mining processes. We define various clustering algorithm for similar kind of web access pattern. These algorithms serve as foundation for the web usage clustering that were described and we conclude that web mining methods and clustering technique are used for selfadaptive websites and intelligent websites to provide personalized service and performance optimization. REFERENCES [1] Ajith Abraham,”Business Intelligence from Web Usage Mining” Journal of Information & Knowledge Management, Vol. 2, No. 4 (2003) 375-390 [2] M.N. Murty, A.K. Jain, P.J. Flynn, “Data clustering: a review”,ACM Computer. Survey. 31 (3) (1999) 64– 323. [3] Hengshan Wang, Cheng Yang, Hua Zeng “ Design and Implementation of a Web Usage Mining Model Based On Fpgrowth and Prefixspan, Communications of the IIMA, Volume 6 Issue 2 [4] Jaideep Srivastava_ y , Robert Cooleyz , Mukund Deshpande, Pang-Ning Tan ”Web Usage Mining: Discovery and Applications of UsagePatterns from Web Data” Volume 1 Issue 2-Page13 [5] V.V.R. Maheswara Rao , Dr. V. Valli Kumari , Dr. KVSVN Raju “Understanding User Behavior using Web Usage Mining” International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 7 [6] Ji He,Man Lan, Chew-Lim Tan,Sam-Yuan Sung, HweeBoonLow, “Initialization of Cluster refinement algorithms: a review and comparative study”, Proceeding of International Joint Conference on Neural Networks[C].Budapest,2004. [7] Renata Ivancsy, Ferenc Kovacs “Clustering Techniques Utilized in Web Usage Mining” International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, Madrid, Spain, February 15-17, 2006 (pp237-242) [8] M.N. Murty, A.K. Jain, P.J. Flynn, “Data clustering: a review”,ACM Computer. Survey. 31 (3) (1999) 64– 323. [9] Bradley P S, Fayyad U M. “Refining Initial Points for Kmeans,Clustering Advances in Knowledge Discovery and Data Mining”, MIT Press. [10] Ruoming Jin , Anjan Goswami and Gagan Agrawal. “Fast and exact out-of-core and distributed k-means clustering Knowledge and Information Systems”, Volume 10, Number 1/July, 2006. Copy Right © INDIACom-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7