Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Volume 2, Issue 2, March-April 2012 Available Online at www.gpublication.com/jcer ISSN No.: 2250 - 2637 ©Genxcellence Publication 2011, All Rights Reserved RESEARCH PAPER Web Mining: An Introductory approach to Increase the Performance of Web Site Using Mined pattern Arun Kumar Singh*1 and Manish Kumar Singh2 1 Associate Professor IIMT Engineering College Meerut (U.P.) India [email protected] 2 Assistant Professor IIMT Institue of Engineering & Technology Meerut (U.P.) India [email protected] Abstract The World-Wide-Web contains a large amount of information. Everyone can store and retrieve the information from web. It is difficult to find the relevant piece of information from web. Extracting the important information from web is called Web Mining. Web mining technologies are best suited for web information extraction and information retrieval. Web mining is one of the mining technologies, which applies data mining techniques in large amount of web data to improve the web services. We are going to give a brief description of web mining and its categorization namely: web content mining, web structure mining and web usage mining. This paper also reports the web data mining with applications. Keywords Web Mining, Information Extraction, Information Retrieval, Web content mining, Web structure mining, Web usage mining and Web crawling INTRODUCTION The World Wide Web is a popular and interactive medium to disseminate information today. With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in order to find, extract, filter, and evaluate the desired information and resources. The World Wide Web provides a vast source of information of almost all types, ranging from DNA databases to resumes to lists of popular multiplexes. Web has a large amount of data and it is not easy task to find out the content or information of our interest. Web mining is one of the techniques to solve such kind of problem. We are not saying that this is the only technique, a no. of technique are namely Machine Learning, Natural Language Processing etc. Due to the large availability of data the World Wide Web, it has become very important for users to use automated tools to find the desired information resources. Information Retrieval is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant as possible. Information extraction aims to extract relevant facts from the documents while aims to select relevant documents [1]. As shown is Figure (1) YAHOO, GOOGLE and MSN are search engines, used to extract the information from web. The extracted information may be relevant but also contain less relevant, and some time irrelevant information. documents, usage logs of web sites, etc. to improve the web services[2]. Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data. A natural combination of Data Mining and World Wide Web may be referred to as Web Mining. Figure (1) Web mining is the Data Mining technique that automatically discovers or extracts the information from web documents. It consists of following tasks [3]: Resource finding: It involves the task of retrieving intended web documents. It is the process by which we extract the data either from online or offline text resources available on web. It includes information retrieval and extraction from web pages. .Information selection and pre-processing: It involves the WEB MINING automatic selection and pre processing of specific Web mining is the application of data mining techniques information from retrieved web resources. This process to extract useful information and knowledge from web transforms the original retrieved data into information. data, including web documents, hyperlinks between Making web data suitable for mining is preprocessing. 9 Please Cite this Article at: Arun Kumar Singh et al, Journal of Current Engineering Research, 2 (2),March-April 2012,09-12 Generalization: It automatically discovers general patterns at individual web sites as well as across multiple sites. Data Mining techniques and machine learning are used in generalization. Result displayed in a web search is aggregation of multiple web documents. Analysis: It involves the validation and interpretation of the mined patterns. It plays an important role in pattern mining. A human plays an important role in information on knowledge discovery process on web. Figure (2) Website structure can be easily understood from figure (2). As we know that website is a collection of related web pages containing images, videos or other digital assets. In figure (2) A, B, C, D, E is different pages of website. It is clear that if hyperlinks are available then we can easily move between pages. In Web mining website structure is also important. If the website structure is complex then it will take less time to move between pages. WEB MINING CATEGORIZATION Web Mining can be broadly divided into three distinct categories, according to the kinds of data to be mined as shown in Figure (3). 1. Web content mining, 2. Web structure mining 3. Web usage mining. In the database view of Web content mining, we are interested in the structure within the web documents (intra- document structure), while in Web structure mining we are interested in the structure of the hyperlinks within the web itself (inter-document structure). Web content mining is mainly related to the content (mainly text) structure extracted from web page. Web structure mining mainly deals with the structure of the web i.e. hyperlink structure and link analysis. Web usage mining discovers interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Web content mining Web pages consist of different types of data attributes, such as text, image, audio, video, meta-data, hyper-link and others. Web content mining is the technology that discovers web characteristics and properties from various data-types and attributes values. Web content mining focuses on the discovery of knowledge from the content of web pages. At the time of retrieving information web users try to find out only web pages that interests him from the huge amount of available pages. Current search tools suffer from low precision due to irrelevance of many of the search results [4] and low recall due to inability to index all the information available on the web. Search engines aren’t able to index all pages resulting in imprecise and incomplete searches due to information overload. Data mining and knowledge discovery in web pages and semi-structured documents are extensively studied by many researchers, and web/text mining targets are classified into three major application categories: 1. Un-Structured documents 2. Semi-Structured documents 3. Structured documents The documents written in HTML are Semi-Structured and in XML are Structured, while the text document are UnStructured in nature. It is difficult to discover appropriate knowledge from image and multimedia web contents. Figure (3) 10 Please Cite this Article at: Arun Kumar Singh et al, Journal of Current Engineering Research, 2 (2),March-April 2012,09-12 Web structure mining Web structure mining is an approach based on directory structures and web graph structures of hyperlinks [5]. Web structure mining is closely related to analyzing hyperlinks and link structure on the web for information retrieval and knowledge discovery. Web structure mining deals with the connectivity of websites and the extraction of knowledge from hyperlinks of the web. The WWW can be viewed formally as digraph with Web nodes and arcs, where the Web nodes correspond to HTML files having page contents and the arcs correspond to hypertext links interconnected with the Web pages. There are two major link-based search algorithms, HITS (Hypertext Induced Topic Search) and PageRank. HITS uses of concept of hub and authority. Hubs are pages contain set of hyperlinks. Authorities are pages that contain useful information about the query. Both types of pages are typically connected: good hubs contain pointers to many good authorities, and good authorities are pointed to by many good hubs. The main drawback of this algorithm is that the hubs and authority score must be computed iteratively from the query result, which does not meet the real-time constraints of an on-line search engine. It couldn’t be implemented in a real time search engine [6]. PageRank and Weighted PageRank were given to overcome HITS’s problem. Internal crawler An internal crawler views the webpage of a single given website and performs focused (page) crawling within that website. The web pages generated by the internal crawlers are more reliable [11]. External crawler An external crawler has a more abstract view of the web as a graph of linked websites. Its task is to select the websites to be examined next and to invoke internal crawlers on the selected sites. External crawler orders unknown websites and initiates an internal crawl on the first page of the website only. APPLICATIONS Web mining can be used in a no. of applications. Google Search engine uses Page Ranking algorithm to index its collection of web pages. Some of the web mining applications are following i) Web Search- Google ii) Web Mining in Distance Education Platform iii) Terror tracking [12] iv) On-line communities (AOL) v) Personalized Portal for the Web – My Yahoo CONCLUSION Web Usage Mining Web usage mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. Web usage mining collects the data from Web log records to discover user access patterns of Web pages. It can discover the browsing patterns of user and some kind of correlations between the web pages. Web usage mining provides the support for the web site design. Typical sources of data in web usage mining are automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies, ands user profiles. Web Usage Mining is to mine data from log record on web page. Log record lots useful information such as URL, IP address and time and so on. Analyzing and discovering logs could help us to find more potential customers and trace service quality and so on [7]. A Web usage mining system performs five major tasks [8]: i) Data gathering ii) Data preparation iii) Navigation pattern discovery iv) Pattern analysis and visualization v) Pattern applications. WEB CRAWLING All search engines available on the internet need to traverse web pages and their associated links, to copy them and to index them into a web database. The programs associated with the page traversal and recovery is called crawlers [9]. The crawling task is divided into two major subtasks corresponding to the two different levels of abstraction [10]: We provided a survey on web mining focusing on current web mining research. In this paper, we present a preliminary discussion about Web mining, including the definition and web mining classification. We also discussed about web structured mining techniques. We have seen that crawler is an important program associated with every search engine, the main decisions associated with the crawlers algorithms are when to visit a site again (to see if a page has changed) and when to visit new sites that have been discovered through links. REFERENCES [1] Raymond Kosala, Hendrik Blockeel, Web Mining Research: A Survey, SIGKDD Explorations, ACM SIGKD July 2000. [2] O.etzioni. The world wield web: Quagmire or Gold Mining. Communicate of the ACM, (39)11:65-68, 1996. [3] Rekha Jain, Dr. G. N. Purohit, Page Ranking Algorithms for Web Mining, International Journal of Computer Applications (0975 – 8887) Volume 13– No.5, January 2011. [4] Masashi Toyoda, Masaru Kitsuregawa What’s Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots, WWW 2006, May 23–26, 2006, Edinburgh, Scotland. ACM 1-59593-323-9/06/0005. [5] Hiroyuki Kawano, Yoshida Hommachi Sakyo-ku, Applications of web mining – from web search engine to P2P filtering. [6] Johannes Fürnkranz, WEB MINING, TU Darmstadt, Knowledge Engineering Group [7] Hong T, Chiang M, Wang S H, "Mining weighted browsing patterns with linguistic minimum supports", 2002 IEEE International Conference on Systems, Man and Cybernetics, 2002,Yasmine Hammamet, Tunisia, pp. 635-639. 11 Please Cite this Article at: Arun Kumar Singh et al, Journal of Current Engineering Research, 2 (2),March-April 2012,09-12 [8] Wen-Chen Hu, Xuli Zong, Chung-wei Lee and Jyh-haw Yeh, World Wide Web Usage Mining Systems and Technologies. [9] Animesh Tripathy, Prashanta K Patra, A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm. [10] Martin Ester, Hans-Peter Kriegel,Matthias Schubert, Accurate and Efficient Crawling for Relevant Websites [11] Satyajeet Nimgaonkar and Suryaprakash Duppala, A Survey on Web Content Mining and extraction of Structured and Semi structured data [12] T.Anand, S.Padmapriya and E.Kirubakaran Terror Tracking Using Advanced Web Mining Perspective lAMA 2009 12