Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Copyright International Journal of Advance Computing Technique and Applications (IJACTA), ISSN : 2321-4546 Knowledge Discovery on Web Information Repository Punyaban Patel, Bibekananda Jena, Bibhudatta Sahoo Asst. Professor, Department of Information Technology, PIET, Rourkela, ORISSA Asst. Professor, Department of Electronics and Telecommunication, PIET, Rourkela, ORISSA Asst. Professor, Department Of Computer Science & Engineering, NIT Rourkela, ORISSA Abstract Web mining techniques seek to extract knowledge from Web data. Web mining is a cross point of database, information retrieval and artificial intelligence. Web content mining (WCM), web structure mining (WSM) and web usage-mining (WUM) buildup the whole web mining. This article critically reviews the research trends in three major area of Web mining(— content, structure, and usage—as well as emerging work in Semantic Web mining and research issues, techniques and development efforts with a goal of providing a broad overview rather than an in-depth analysis are presented in this paper. 1. Introduction Web mining is an integrated technology in which several research fields are involved, such as data mining, computational linguistics, statistics, informatics and so on. Different researchers from different communities disagree with each other on what Web mining is exactly [2]. Many unrelated projects have explored different aspects of this problem as well. Since Web mining derives from data mining, its definition is similar to the well-known definition of data mining [3]. Nevertheless, Web mining has many unique characteristics compared with data mining. Firstly, the source of Web mining is web documents. We consider the use of the Web as a middleware in mining database and the mining of logs, user profiles on the Web server still belong to the category of traditional data mining. Secondly, the Web is a directed-graph consists of document nodes and hyperlinks. Therefore, the pattern identified can be possibly about the content of documents or about the structure of the Web. Moreover, the Web documents are semi-structural or non-structural with little machine-readable semantic while the source of data mining is confined to the structural data in www.ijacta.com database. As a result, some traditional data mining methods are not applicable to Web mining. Even if applicable, they must be based on the preprocessing of documents. Data mining is used to identify valid, novel, potentially useful and ultimately understandable pattern from data collection in database community [3]. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The most commonly used techniques in data mining are: Artificial neural networks, Association Rule, Classification, Clustering, Data visualization, Decision trees, Genetic algorithms, Nearest neighbor method, and role induction. However, there is little work that deals with unstructured and heterogeneous information on the Web. Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content and usage log. Based on the primary kind of data used in the mining process, Web mining tasks are categorized into three main types: Web structure mining, Web content mining and Web usage mining. Web mining is a new research issue under dispute, which draws great interest from many communities. Currently, there is no agreement about Web mining yet. It needs more discussion among researchers in order to define what it is exactly. Meanwhile, the development of Web mining system will promote its research in turn. In this paper a preliminary discussion about Web mining is given. Firstly we present the definition of Web mining and expound the relationships between Web mining, traditional data mining and information retrieval on the Web. Then, we present the taxonomy of Web mining and briefly describe the functions of Web text mining. Finally, we present the conclusions and challenges for Web mining in future. Page 49 Copyright International Journal of Advance Computing Technique and Applications (IJACTA), ISSN : 2321-4546 2. Web Mining Process Model The Web mining can be defined as: Web mining is the activity of identifying patterns p implied in large document collection C, which can be denoted by a mapping : C p. The general process of Web Mining may be four different stages(Figure-1): (i) resource discovery, (ii) information pre-processing, (iii) generalization, and (iv) pattern analysis. Resource Discovery Information Pre-processing Generalization Pattern Analysis Figure 1: General Process of Web Mining Resource Discovery, whose task is retrieving web documents, is the process of retrieving the web resources. Web information retrieval is the process to find a subset S of appropriate number of documents relevant to a certain query q from large document collection C, which can also be denoted by a mapping : (c, q) S . Information Pre-processing is the transform process of the result of resource discovery. Generalization is to uncover general patterns at individual web sites and across multiple sites [4]. In this step machine learning and traditional data mining techniques are typically used. Pattern Analysis is the validation of the mined patterns. 3. A Taxonomy Of Web Mining Researchers have identified three broad categories of Web mining: Web content mining, Web structure mining and the Web usage mining [18]. The taxonomy is shown in figure-2 below. Web mining Web Content Mining Text Mining Multimedia Mining Web Structure Mining Search Result Mining Web External Structure Mining Web Usage Mining General Access Pattern Tracking Customize Usage Tracking Web URL Mining Web Internal Structure Mining Figure 2: A Taxonomy of Web Mining www.ijacta.com Page 50 Copyright International Journal of Advance Computing Technique and Applications (IJACTA), ISSN : 2321-4546 3.1 Web content mining is the application of data mining techniques to content published on the Internet, usually as HTML (semi-structured), plaintext (unstructured), or XML (structured) documents. Nearest Neighbor algorithm [8], Naive Bayes algorithm [9], etc. b) Text clustering: The difference between text clustering and categorization lies in that no taxonomy is predefined for clustering. The goal of text clustering is usually to divide documents collection C into a set of clusters such that inter- cluster similarity is minimized and intra-cluster similarity is maximized. We can apply text clustering to the organization of retrieval results returned by search engine. Then users merely need to examine the clusters relevant to their query, which dramatically reduces the time and the efforts spent in sifting through the long list of documents. Numerous textclustering algorithms have been proposed so far, all of which fall into two types. One is hierarchical clustering such as G-HAC algorithm [8], the other is partitioned clustering represented by k-Means algorithm [l0]. The Web Text Mining is given in figure-3 below. Web content mining can be divided into text mining (including text file, HTML, DHTML, XML document, etc.), multimedia mining and the Search Result Mining. The main categories of Web text mining are text categorization, text clustering, association analysis, trend prediction and so on. 3.1.1 Text mining a) Text categorization: Given a predefined taxonomy, each document in collection C is categorized into one appropriate class or more. In this way, it is not only convenient for users to browse documents but also easier to search documents by specifying class. Currently there are many text categorization algorithms, in which the most commonly used are k- Categorizing Agent The web Gathering Agent Preprocess Agent Document Repository Feature DB Document Cube MDDA Engine Interface Agent Document View Clustering Agent Figure-3: Web Text Mining System 3.1.2 Multimedia Mining Multimedia Data essentially refers to data such as text, numeric, images, video, audio, graphical, temporal, relational and categorical data. Multimedia Mining is discovering knowledge from large amounts of different types of multimedia data. It involves the extraction of implicit knowledge, multimedia data relationships, or other patterns not explicitly stored in multimedia databases. The multimedia mining involves two basic steps: • Extraction of appropriate features from the data; and www.ijacta.com • Selection of data mining methods to identify the desired information. Multimedia mining attempts to automatically discover knowledge in multimedia datasets. It is a comparatively young discipline and the research community is still adapting techniques from more established fields, such as artificial intelligence and pattern recognition, and developing new techniques particularly tailored for multimedia mining applications. The exponential growth of multimedia data in consumer as well as scientific applications, Page 51 Copyright International Journal of Advance Computing Technique and Applications (IJACTA), ISSN : 2321-4546 however, requires fast progress of research in the field in order to find adequate techniques to analyzes and manage such data, including feature extraction, high-dimensional indexing, similarity based search, and personalized search and retrieval. The major challenge is to unify all these existing methods from various fields into a multimedia-mining framework. Although multimedia mining is drawing more and more interests, text mining is the most fundamental and important task as text is the primary information vehicle. The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. 3.1.3 Search Result Mining (i) Automatic classification in documents using searching engine Search engine can index a mass of disordered data on Web. For example, firstly, Web crawlers down load Web pages from Web sites. Secondly, searching engine extracts describable index information from these Web pages to store them with URL into searching engine base. Thirdly using data mining methods we automatically class them into usable Web page classification system organized by hyperlink structure. (ii) Implement friendly interface from searching result: In the classification system there are many unrelated information. If we can analyze and cluster them in the searching results, searching effect will be highly improved. 3.2 Web structure mining operates on the Web’s hyperlink structure. Conventionally, a website structure may be evaluated with the Card Sorting technique; however, this technique is difficult to implement for a large and information-centric website. There are a number of commercial and free evaluation tools available. Most of the tools are based on the user logging results that are stored in the proxy server or through the embedment of client-side programming code to capture the client’s events. A recent publication by Miller and Remington [14] pointed out that the structure of linked pages (the site’s information architecture) has a decisive impact on the usability. Previous studies including www.ijacta.com Shneiderman [16], and Larson and Czerwinski [13] also provided suggestions on how to create the best structure. Larson and Czerwinski [13] found that users took significantly longer time to find items in a structure with depth than breadth. They compared a three-tiered, eight-links-per-page (8 x 8 x 8) structure with two-tiered, 16 and 32 links per page structures (16 x 32 and 32 x 16). Bernard [1] devised a metric called Hypertext Accessibility Index (HAI) to model the informational accessibility of a particular hypertext structure compared. It explained what Larson and Czerwinski [13] as well as Kiger [12] found about the navigation time of shallow and deep structure in quantitative terms. Kao et al. [11] proposed a LAMIS method with the entropy analysis to distill the information of a website. Empirically, the probability is directly linked to the transition probability of the web log. A study on the use of entropy theory to merge the website content was performed by Chen et. al. [6]. Based on the Markov model, Jenamani et al. [2, 15] proposed several algorithms to examine (1.) the most accessed pages, (2.) the company’s interest, (3.) the visitor’s interest pages, (4.) the current visitor’s interest pages, (5.) the customized index generation algorithms. However, our system objectives are not to provide a dynamic hyperlink structure, we will concentrate on the study of building a static optimal website structure. Web structure mining can be further divided into external structure (hyperlink between web page) mining, internal structure (of a web page) mining and URL mining. Craven et al. [7] used first-order learning in categorizing hyperlinks to estimate the relationship between web pages. They also made use of the anchor text in hyperlink to categorize the target web page. Brin et al. [9] took both the citation counting of referee pages and the importance of referrer pages into account to find pages that are “authorities” on particular topics. Spertus et al. [6] proposed some heuristic rules by investigating the internal structure and the URL of web pages. 3.3 Web usage mining (WUM) analyzes results of user interactions with a Web server, including Web logs, clickstreams, and database transactions at a Web site or a group of related sites(shown in figure4). Web usage mining introduces privacy concerns and is currently the topic of extensive debate. Page 52 Copyright International Journal of Advance Computing Technique and Applications (IJACTA), ISSN : 2321-4546 Data cleaning Server Log Data Transaction identification Clean log Registrat ion data Data Integration Formatted data Trans action data Pattern analysis Path Analysis OLAP/ Visualizatio n Tools Associati on Rules Integr ated data Document & usage Attributes Pattern Discovery Transfo rmation Database Query language Sequentia l Patterns Clusters & Classifica tion rules Knowledge Query Mechanism Intelligent Agents Fig 4: A General Architecture for Web Usage Mining Web usage mining is the type of web mining activity that involves the automatic discovery of user access patterns from one or more web servers. As more organizations rely on the Internet and the World Wide Web to conduct business, the traditional strategies and techniques for market analysis need to be revisited in this context. Organizations often generate and collect large volumes of data in their daily operations. Most of this information is usually generated automatically by web servers and collected in server access logs. Other sources of user information include referrer logs, which contain information about the referring pages for each page reference, and user registration or survey data gathered via tools such as CGI scripts. Analyzing such data can help these organizations to determine the life time value of customers, cross marketing strategies across products, and effectiveness of promotional campaigns, among other things. Analysis of server access logs and user registration data can also provide valuable information on how to structure a web site in order to create a more effective presence for the organization. Using intranet technologies in organizations, such analysis can shed light on more effective management of workgroup communication and organizational infrastructure. Finally, for www.ijacta.com organizations that sell advertising on the World Wide Web, analyzing user access patterns helps in targeting ads to specific groups of users. The problems existed in WUM is that the physical structures of Web sites are different from user accessing patterns and it is very difficult to find out users, sessions and transactions. WUM can be decomposed into the following subtasks [17] is shown in figure-5: 3.3.1 Data Pre-processing for Mining: It is necessary to perform a data preparation to convert the raw data for further process. It has separate subsections as follows: Content Preprocessing: Content preprocessing is the process of converting text, image, scripts and other files into the forms that can be used by the usage mining. Page 53 Copyright International Journal of Advance Computing Technique and Applications (IJACTA), ISSN : 2321-4546 Raw Logs User Session Files Preprocessing Rules, Patterns and Statistics Mining Algorithm “Interesting” Rules, Patterns and Statistics Pattern Analysis Site Files techniques can be used to discover unordered Figure 5: The Subtasks of WUM correlation between items found in a database of Structure Preprocessing: The structure of a web site is formed by the hyperlinks between page views. The structure preprocessing can be treated similar as the content preprocessing. However, each server session may have to construct a different site structure than others. Usage Preprocessing: The inputs of the preprocessing phase may include the web server logs, referral logs, registration files, index server logs, and optionally usage statistics from a previous analysis. The outputs are the user session file, transaction file, site topology, and page classifications. 3.3.2 Mining Algorithm and Pattern Discovery This is the key component of the web mining. Pattern discovery covers the algorithms and techniques from several research areas, such as data mining, machine learning, statistics, and pattern recognition. It has separate subsections as follows. Statistical Analysis: Statistical analysts may perform different kinds of descriptive statistical analyses based on different variables when analyzing the session file. By analyzing the statistical information contained in the periodic web system report, the extracted report can be potentially useful for improving the system performance, enhancing the security of the system, facilitation the site modification task, and providing support for marketing decisions. Association Rules: In the web domain, the pages, which are most often referenced together, can be put in one single server session by applying the association rule generation. Association rule mining www.ijacta.com transactions. Clustering: Clustering analysis is a technique to group together users or data items (pages) with the similar characteristics. Clustering of user information or pages can facilitate the development and execution of future marketing strategies. Classification: Classification is the technique to map a data item into one of several predefined classes. The classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, naïve Bayesian classifiers, k-nearest neighbor classifier, Support Vector Machines etc. Sequential Pattern: This technique intends to find the inter-session pattern, such that a set of the items follows the presence of another in a time-ordered set of sessions or episodes. Sequential patterns also include some other types of temporal analysis such as trend analysis, change point detection, or similarity analysis. Dependency Modeling: The goal of this technique is to establish a model that is able to represent significant dependencies among the various variables in the web domain. The modeling technique provides a theoretical framework for analyzing the behavior of users, and is potentially useful for predicting future web resource consumption. 3.3.3 Pattern Analysis Pattern Analysis is a final stage of the whole web usage mining. The goal of this process is to eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process. The output of web mining algorithms is often not in the form suitable for direct human consumption, and thus need to be transform to a format can be assimilate easily. There are two most common approaches for the patter analysis. One is to use the knowledge query mechanism such as SQL, while another is to construct multi-dimensional data cube before perform OLAP operations. All these Page 54 Copyright International Journal of Advance Computing Technique and Applications (IJACTA), ISSN : 2321-4546 methods assume the output of the previous phase has been structured. Rules, Fuzzy Clustering, Web Site Evaluation, Recommender Systems, and Privacy Issues. 4. Web Mining-Research and Practice Since the Researchers have identified three broad categories of Web mining: [19, 20] Web content mining, Web Structure mining and the Web Usage mining, we will discuss some important research contributions in Web mining for knowledge discovery, with a goal of providing a broad overview rather than an in-depth analysis. Web Content and Structure Mining Some researchers combine content and structure mining to leverage the techniques’ strengths. Although not all researchers agree to such a classification, we list research in these two areas together. Fabrizio Sebastini[21] and Soumen Chakrabarti[22] discuss Web content mining techniques in detail, and Johannes Fürnkranz[23] surveys work in Web structure mining , which suggests the following areas (.i)Web as a Database, (ii) Document Classification, (iii)Hubs and Authorities, (iv) Clever: Ranking by Content, (v) Identifying Web Communities, (v) Web Usage Mining 5.Conclusion Web mining is a new field and in its infancy, there are rare researchers ventured in this field, especially the Chinese and Japanese word and the hybrid text mining techniques in our country. The key component of web mining is the mining process itself. A lot of work still remains to be done in adapting known mining techniques as well as developing new ones. The open research problems in the area of web mining are related to: New Types of Knowledge, Distributed Web Mining, Incremental Web Mining, Improved and Mining Algorithms. Recent research has mostly focused on Web usage analysis, partly because of its applicability in ebusiness. We expect privacy issues, distributed Web mining, and Semantic Web mining to attract equal, if not more, interest from the research community. Increased use of Web mining techniques will require that privacy issues be addressed, however. Similarly, aggregating data in a central site and then mining it is rarely scalable; hence the need for distributed mining techniques. Finally, researchers will need to leverage the semantic information the Semantic Web provides. Exposing content semantics and the link explicitly can help in many tasks, including mining the hidden Web—that is, data stored in databases and not accessible through search engines. As new data is published every day, the Web’s utility as an information source will continue to grow. Web usage mining has several applications in ebusiness, including personalization, traffic analysis, and targeted advertising. The development of graphical analysis tools such as Webviz [24] popularized Web usage mining of Web transactions. The main areas of research in this domain are Web log data preprocessing and identification of useful patterns from this preprocessed data using mining techniques. Several surveys on Web usage mining exist.[25, 26] Most data used for mining is collected from Web servers, clients, proxy servers, or server databases, all of which generate noisy data. Because Web mining is sensitive to noise, data cleaning methods are necessary. Jaideep Srivastava and colleagues[25] categorize data preprocessing into subtasks and note that the final outcome of preprocessing should be data that allows identification of a particular user’s browsing pattern in the form of page views, sessions, and clickstreams. Clickstreams are of particular interest because they allow reconstruction of user navigational patterns. Recent work by Yannis Manolopoulos and colleagues[27] provides a comprehensive discussion of Web logs for usage mining and suggests novel ideas for Web log indexing. Such preprocessed data enables various mining techniques. Some of the notable research areas are follows: Adaptive Web Sites, Association www.ijacta.com References: [l] Bernard, M. L., “Examining a Metric for Predicting the Accessibility of Information within Hypertext Structures”, Dissertation Thesis, Wichita State University, 2002 [2] Jenamani M., Mohapatra P. K. J., and Ghose S., “Online Customized Index Synthesis in Commercial Websites”, IEEE Intelligent Systems, Vol.17, No.6, 2002, pp.20-26. [3] Usama Fayyad et al, “The KDD Process for Extracting Useful Knowledge from Volumes of Data”, Communications of the ACM, Vol. 39, No. 11, Nov. 1996, pp. 27-34. [4] O.etzioni. The world wield web: Quagmire or Gold Mining. Communicate of the ACM, (39)11:65-68, 1996 [5] J. Han, M. Kamber, ”Data Mining”, Morgan Kaufmann Publishers, 2001. [6] Chen, Z., Liu, S., Liu, W., Pu, G. and Ma, W., “Building a Web Thesaurus From Web Link Structure”, Page 55 Copyright International Journal of Advance Computing Technique and Applications (IJACTA), ISSN : 2321-4546 Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, 2003, pp 48-55. [7] M. Craven, S. Slattery and K. Nigam, “First-Order Learning for Web Mining”, In Proceedings of the 10th European Conference on Machine Learning, Chemnitz, 1998. [8] P. Willet, “Recent Trends in Hierarchical Document Clustering: a Critical Review”, Information Processing and Management, Vol. 24, 1988, pp. 577-597. [9] Sergey Brin, Lawrence Page, “The Anatomy of Largescale Vol. 1, NO. I, 1997, pp. 68-88. Hypertextual Web Search Engine”, In Proceedings of the Seventh International World Wide Web Conference, 1998. [10] J. J. Rocchio, “Document Retrieval Systems – Optimization Evaluation”, Ph.D. Thesis, Harvard University, 1966. [11] Kao, H., Lin, S., Ho, J., Chen, M., “Mining Web Informative Structures and Contents Based on Entropy Analysis”, IEEE Transactions on Knowledge and Data Engineering, Vol.16, Iss.1, 2004, pp.41 – 55. [12] Kiger, J. I. , “The Depth / Breadth Tradeoff in the Design of Menu-Driven Interfaces.”, International Journal of Man-Machine Studies, Vol.20, 1984, pp.201-213. [13] Larson, K., & Czerwinski, M., “Web page design: Implications of memory, structure and scent for information retrieval”, CHI’98: Human Factors in Computing Systems, New York: ACM Press, 1998, pp.2532. US Nat’l Science Foundation Workshop on NextGeneration Data Mining (NGDM), Nat’l Science Foundation, 2002. [20] R. Kosala and H. Blockeel, “Web Mining Research: A Survey,” ACM SIGKDD Explorations, vol. 2, no. 1, 2000, pp. 1–15. [21] F. Sebastini, Machine Learning in Automated Text Categorization, tech. report B4-31, Istituto di Elaborazione dell’Informatione, Consiglio Nazionale delle Ricerche, Pisa, 1999. [22] S. Chakrabarti, “Data Mining for Hypertext: A Tutorial Survey,” ACM SIGKDD Explorations, vol. 1, no. 2, pp. 1–11, 2000. [23] J. Fürnkranz, “Web Structure Mining: Exploiting the Graph Structure of the World Wide Web,” Österreichische Gesellschaft für Artificial Intelligence (ÖGAI), vol. 21, no. 2, 2002, pp. 17–26. [24] J.E. Pitkow and K. Bharat, “WebViz: A Tool for WWW Access Log Analysis,” Proc. 1st Int’l Conf. World Wide Web, Elsevier Science, 1994, pp. 271–277. [25] J. Srivastava et al., “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data,” ACM SIGKDD Explorations, vol. 1, no. 2, 2000, pp. 12–23. [26] M Eirinaki and M Vazirgiannis, “Web Mining for Web Personalization,” ACM Trans. Internet Technology, vol. 3, no. 1, 2003, pp. 1–27. [27] Y. Manolopoulos et al., “Indexing Techniques for Web Access Logs,” Web Information Systems, IDEA Group, 2004. [14] Miller, C. S. and Remington, R. W., “Modeling Information Navigation : Implications for Information Architecture”, Human-Computer Interaction, Vol.19, No.3, 2004. [15] Pitkow, J. E. and Pirolli, P. “Mining Longest Repeated Subsequences to Predict World Wide Web Surfing.”, Second USENIX Symposium on Internet Technologies and System,1999. [16] Shneiderman, B., “Designing the User Interface”, Strategies for Effective Human-Computer Interaction, 3rd ed, Reading, MA : Addison-Wesley, 1998. [17] Yan Wang, Web Mining and Knowledge Discovery of Usage Patterns, February, 2000. [18] Margaret H. Dunham, S. Sridhar,” Data Mining: Introductory and Advanced Topics”, Pearson Education, 2006. [19] J. Srivastava, P. Desikan, and V. Kumar, “Web Mining: Accomplishments and Future Directions,” Proc. www.ijacta.com Page 56