Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
International Journal of IT, Engineering and Applied Sciences Research (IJIEASR) Volume 3, No. 2, February 2014 ISSN: 2319-4413 A Comparative Study of Various Issues in Web Mining Monika Pathak, Assistant Professor, Multani Mal Modi College, Patiala,India Sukhdev Singh, Assistant Professor, Multani Mal Modi College, Patiala,India ABSTRACT Due to increased amount of data available online, World Wide Web is the most usable and valuable way to retrieve data. But it is difficult to find relevant information for a particular application as data is available in different forms and formats. Web mining is the one of the most suitable technique of data mining to extract useful information from web.In this paper, we have discussed the web mining process with general architecture. It also contains elementary technical detail of major applications of web mining with suitable examples.The paper explores the technical difference between data mining and web mining on the basis of implementation issues. The objective of the study is to identify the issues and categories these issues according to implementations. The issues are categorised into three major categories: social issues, technical issues and law enforcement issues. Further we have compared these categories to find out the sensitivity of the issues.The paper concluded with few suggestions to overcome technical issues related to heterogeneous representation of data to identify future directions. Keywords: Web mining, Web Usage Mining, Web Content Mining, Web Structure Mining, Clustering analysis, Pattern analysis. I. INTRODUCTION An abundance of information is available over the internet but it is hard to find relevant information for a particular application. The web mining techniques are used to Data mining Data mining is a process of extracting data from large databases such as oracle, sql server, db database etc. It is also known as knowledge discovery in databases . Data mining techniques process large number of data from databases where data is available in structured form such as tables, files and views. In tabular form data is stored in tables and linked with each other with primary and foreign keys. Data is stored in database managed by database management system and DBMS gives authorization discover valuable information from web resources.The web is the biggest and most widely used source or way to extract/search information. Web consists of billions of interconnected web pages. By clicking on these web pages, we can find or share any type of information on the web. The Web has become a channel for business, online shopping, sharing information and opinions with other people from anywhere in the world. Web acts as a virtual society. The actions and operations [2] on web depend upon the structure of hypertexts which allows web page users to link their documents with other related documents. Web mining [3]allows users to extract or mines useful or relevant information on internet but it’s a challenging task due to presence of huge, continuously growing and wide coverage of data. Data is present in different forms like audio, video, text, structured and unstructured, images etc. Web mining is the data mining technique to automatically discover and extract useful information from web or internet. Web mining provides many types of information like web activity (activity tracking), web graph [4](links between pages and people) and web content (data in web pages and documents). Web content mining extracts useful information from the web page [1] contents and web usage mining maintains user access patterns from usage logs means it records clicks used by every user. Web structure mining defines the structure of the web and extracts useful information from the hyperlinks. Information provided by web mining also depends upon multiple factors like size of web, number of web pages and number of websites.The technical issues as compare to data mining web mining is a more complicated process and we emphasis the same in comparison made between both. Web mining Web mining is the application of traditional data mining techniques. It is a technique to collect the data from various web resources like html pages, xml etc. Web mining technique processes large number of data as compared to data mining techniques where data is available in heterogeneous form such as data embedded in html and xml, heterogeneous representation of data over the web. Data is stored in the form of web pages and web files in public domain. Data is public and not secured. The i-Explore International Research Journal Consortium www.irjcjournals.org 31 International Journal of IT, Engineering and Applied Sciences Research (IJIEASR) Volume 3, No. 2, February 2014 ISSN: 2319-4413 to users to access the data. That is why data is private registration can be imposed to web resources require and secured authentication and complicated process for mining. Traditional data mining has levels of explicit Web mining task is processed under unstructured or structure and representation. semi-structured data from web pages. User needs access right to read the data because data User rarely requires access right to read the data because data is public on web resources. is private. Data is stored in a database and has restricted access Over the web, data is globally available and within to the users. User cannot share their common an organization data is available over the intranet interests because there is no interconnectivity which acts as a platform for sharing information between different databases. Different database may among different users.With the help of hyperlinks, users can share common interest with each other. have different schema definition and cannot share. Table1: Demonstrate complication among Data miming and Web Mining The present study aims to discover challenges come across while performing web mining to extract information from large web based data so that processed information can be utilized for a particular application such as product promotion, SSR etc. II. WEB MINING Web mining is a process of extracting useful information from the web resources. Many organizations and companies use web mining techniques to extract and share the useful information for their business development. Web mining technique also raises an idea of data security of personal information available on the internet.Different web mining techniques [5] are used to retrieve information from the web, based on web content mining, web structure Finding Web Resources Select Type of Information mining and web usage mining. Web mining is an application of data mining technique. Web Mining Process:The web mining is a processing of discovering facts from predefined data extracted from web resources. The interest of information may vary according to the need of particular applications. In general, web mining processing [6] can be expressed in through following steps: Find the resource:In this step web resource is located and source document is finalized. Select type of information: The type of information is finalised so that selection of information is automated for processing information from web resources. Generalize the Information Analysis the Information Extract Required Information Figure1: Web Mining Process Generalize the information: The information is further processing to general pattern from websites. Analysis the information: The extracted information is validatedand interpreted to find the patterns. The patterns are processed to represent information. Extract the required information: The information is extracted after pattern matching and can be filter further according to the requirement of a particular application. Web mining has many applications which help users to extract useful information and make suitable decisions. III. ARCHITECTURE OF WEB MINING The web mining is categorized into three basic categories which are based on methods used to access information over the web. The web resources are processed on the bases of content, structure and usages. The following diagram demonstrates different categories used in architecture. Web Content Mining (WCM): It is also known as text mining. Web content mining [7] is the process extracting useful and important information from the web page contents. Content data may consist of text, images, audio, video, structured and unstructured tables. Web content mining or text mining is mostly used in discovery and tracking, clustering of web pages [8] and classification of web pages. Web content mining is used to gather, categorize and provide the best possible information available on the web. In short, web content mining or text mining allows search engines to increase the flow of user clicks to web sites, web pages of websites to solve their queries.It provides the results in terms of highest relevance to search engines and provides specific information to the user. It reduces the irrelevant information due to navigation of information on web and provides higher quality of results to users. i-Explore International Research Journal Consortium www.irjcjournals.org 32 International Journal of IT, Engineering and Applied Sciences Research (IJIEASR) Volume 3, No. 2, February 2014 ISSN: 2319-4413 Figure2: Web Mining Categorization Web Structure Mining (WSM): Web structure mining [9] consists of web pages and hyperlinks connecting related pages.It analyse the structure of each page contained in a website. It represents the structure of web pages and hyperlinks. With the help of hyperlinks, users share their interest with each other. Based on structural information, it is further divided into two categories: Hyperlinks:A hyperlink is a unit which connects a web page with other web pages. A hyperlink [10] which connects to different part of the same page is called intradocument hyperlink. Hyperlink which connects two different pages is called an inter- document hyperlink. Structure of Documents:The information present on web is available in a structured format and present in a hierarchical manner. The web mining techniques is used todiscovering useful knowledge from the structure of links or topology of the hyperlinks between web pages. It is useful for determine most accessed pages. It aims to analyse the wayin which different web documents are linked together. suitable usage pattern from web usage data to understand the web based applications. It is used to understand the user’s behaviour and need with the information available in the web site. It provides data about referring page, sequencing of pages visited, cookies files contain information and user spent time at site.It is very difficult to track the user through a site because only bits of information like IP address, user information and site clicks are available. Web usage mining is divided into following types: Data of Web Server Logs: It contains information about name, IP address, access time and resources location. Data of Application Server Logs: It contains information of user activities like IP address, request source. Data of Application Level Logs: It maintains information of user at application level like number of hits on web page, old reference and new references. We also compare web mining categories based on the availability of data, type of data, methods and applications. Web Usage Mining (WUM):Web usage mining [11] is the application of data mining technique which finds out Issues Availability of Data Type and form of data Methods Application categories Web Contents Mining Structured/ Semi Structured Text/ Hyper Text Web Structure Mining Linked Structures Web Usage Mining Interactivity Linked Server Log and client log from web Browser Statistical and Machine Learning methods Marketing and Management applications Statistical and Machine Proprietary algorithm Learning methods Application based on Clustering applications pattern matching Table2: Demonstrate category of Web Mining IV. APPLICATIONS OF WEB MINING The web mining techniques are used to extract information from large database of web so that extracted information can be used for meaningful task. The large data is keep on adding over the web day by data, a large number of application are proposed to implicate information extracted from web mining. We have introduced few applications according to current scope of the research in mind. i-Explore International Research Journal Consortium www.irjcjournals.org 33 International Journal of IT, Engineering and Applied Sciences Research (IJIEASR) Volume 3, No. 2, February 2014 Business Intelligence:Webmining is an application of data mining which helps organizations to promote their business by reducing the cost of product and increase profits by selling more products. Many type of’ customers and their profiles are available on internet which helps companies to provide their services to customers according to their needs.Web mining is a powerful technique which collects information about customer’s activities on website, helps in decision making process and predicts the customer’s behaviour. This helps companies to develop new products and services. • • Modification of Website:The structure and data of the website is important for the customer’s preferences. Normally many types of users have different priorities, knowledge, and preferences etc. which make it difficult to find design suitable for all types of users. In this case, web usage mining is used to find out the types of users accessing the website according to their preferences, knowledge and design the website based on user’s priorities. Improvement in System:In web mining technique, web traffic navigates the path of the user which improves the performance and services of websites. For example: cashing, load balancing. The navigation of path used in detection of fraud, break-ins etc. • • Personalization of Web: It is an attractive application area which helps web based companies by allowing them recommendations and marketing campaigns etc. and automatically do this in real time when the user access the website. V. COMPARATIVE ANALYSIS OFISSUES IN WEB MINING • The Present study is focused on identifying the issues related to different phases of web mining process and categorized them so that they can be explore with relevant suggestions. The following are the categories of issues: Social Issues, Technical Issues and Legal Issues: Social Issues: The social issues cover the privacy, optimum use of data along with reliability of the data. • • Privacy Issues by Web Mining: Web data mining involves the use of personal data on web. The security/privacy of personal data of user on the web is an important issue. The privacy is violated when user’s personal information is obtained and used without user’s knowledge. When user’s personal data is extracted through web mining, then it is a privacy violation process. Optimum Extraction of Data: A large number of data is available on web and data is duplicated at i-Explore International Research Journal Consortium • • ISSN: 2319-4413 different locations which raise issues of reliability of data. A mechanism is required to extract optimum form of data so that relevant information can be gathered. Reliability: World Wide Web is an open global system of sharing information which raises issues of reliability of data. The reliability issue incorporate validity of data, validity of source and consistency of data. It enforces the policies which must ensure validity and consistency of data which is available at multiple web resources. Technical Issues: The technical issues cover the issues related to implementation of web mining process and issues related to analysis of pattern to discover information. Segmentation of Web Page: In web based application, a web page contains information in heterogeneous form with additional advertisements and commercials. The objective of any web mining tool is to extract main contents of the web page and ignore additional information such as external links, copy right information, advertisement notes. These issues are related to segmentation [12] of web page which requires state of art. Noisy Information:The information available on web is noisy. The noise arises due to the two main reasons. First is web page contains many information like main content of the page, advertisements, navigation links, privacy policies etc. The only small portion is useful and rest is considered noisy information. Second is web has no control of information means one can write any time of information of his/her choice having very poor quality in large amount on the web. Knowledge Synthesis: The knowledge synthesis [13] is one of the burning issues of web mining in which we need to specify hierarchy of information extracted from multiple resources. The ontology of information should be generalized to cater heterogeneous information gathered from multiple resources. The information should be synthesised in such a way that it should be presentable and correlated with contents. Heterogeneous Information: The multiple pages show the similar information in different words or formats on the web. It proves that the integration of information from multiple pages is a challenging problem in web mining. Integrity: The law of integrity enforce that data available over the web should be correct and consistent. The web mining tool collects the data from web but the issue of integrity is still a manual job is automation of integrity rules cannot be generalized on heterogeneous data [14]. www.irjcjournals.org 34 International Journal of IT, Engineering and Applied Sciences Research (IJIEASR) Volume 3, No. 2, February 2014 ISSN: 2319-4413 Figure3: Web Mining Issues Classification Law Enforcement: The issues covered in law enforcement category handle law policies, authentication, and crime detection. • Law Policies for Data Sharing:With the increase of sensitive information over the internet, results with criminal activities such as piracy, leakage of sensitive organizational information on the web. Every organization should have certain law policies regarding the sharing of information over the internet. • Authentication: The information over the web is available globally which allows misuse of sensitive information in anti-social activities. There must be some mechanism to identify the user of information so that he/she can be tracked to avoid criminal activities. • Crime Detection: The information over the web in the form of email and document can be processed under web mining to find evidence and clues. VI. sort out if web resources use uniform format of information representation like xml structures. VII. CONCLUSIONS Web mining is an application of data mining technique to discover the useful information from the World Wide Web. Data mining extracts information from databases, but web mining discovers data from web. Data can be text, images, audio, video, tables etc. Web mining ranks the various websites which helps the organizations to find the user’s behaviour, needs, preferences etc. so that organizations can promote their products properly and to gain maximum profit. But this technique suffers with various problems like poor quality of data due to noisy information provided by the number of websites. User’s personal data is available on the web resources and anyone can use this data without user’s knowledge and creates privacy issues.In this paper, we have categorized different issues into three major categories namely social, technical and law enforcement issues. Further we have conducted comparative analysis of these issues. It has been observed that most of the problems related to noisy data is due to heterogeneous structure of web resources which can be i-Explore International Research Journal Consortium REFERNCES [1]D. Fetterly, M. Manasse, M. Najork, and J. Wiener,“A Large-Scale Study of the Evolution of Web Pages ”, In proceeding of the 12th International World Wide Web Conference, pp. 669–678, 2003. [2] R. Kosala, “ Web Mining Research: A Survey“, ACM SIGKDD Explorations, Vol 2, pp 1-15, 2000. [3] R. Kosala, H. Blockeel and F. Neven," An Overview of Web Mining", In J. Meij, editor, Dealing with the Data Flood: Mining Data, Text and Multimedia, pages 480–497. STT, Rotterdam, 2002. [4] H. Ino, M. Kudo, and A. Nakamura, "Partitioning of Web Graphs by Community Topology",in Proc. of the 14th Intl. Conf. on World Wide Web, pp. 661– 66,2005. [5] Han, J., Kamber, M. Kamber, "Data mining: concepts and techniques, Morgan Kaufmann Publishers, 2000. [6] R. Kosala, H. Blockeel, “Web Mining Research: A Survey”, SIGKDD Explorations, Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining Vol. 2,pp 1-15, 2000. [7] Raymond Kosala, HendrikBlockeel, "Web Mining Research: A Survey",ACM SIGKDD Explorations Newsletter, Vol. 2, 2000. [8] B. Mirkin, "Clustering for Data Mining: A Data Recovery Approach",Chapman & Hall/CRC, April 29, 2005. [9] P Ravi Kumar, Singh Ashutoshkumar, "Web Structure Mining Exploring Hyperlinks and Algorithms for Information Retrieval", American Journal of applied sciences,Vol.7,: pp. 840845,2010. [10] J. Kleinberg, "Authoritative Sources in a Hyperlinked Environment", Journal of the ACM 46 (5), pp. 604–632, 1999. www.irjcjournals.org 35 International Journal of IT, Engineering and Applied Sciences Research (IJIEASR) Volume 3, No. 2, February 2014 [11] R. Cooley, "Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data", PhD thesis, University of Minnesota, 2000. [12] K. Lerman, L. Getoor, S. Minton, and C. Knoblock, "Using the Structure of Web Sites for Automatic Segmentation of Tables", inthe proceeding of the ACM SIGMOD in international conference onManagement of Data (SIGMOD’04), pp. 119– 130, 2004. i-Explore International Research Journal Consortium ISSN: 2319-4413 [13] R. Cooley, B. Mobasher, and J. Srivastava, "Data Preparation for Mining World Wide Web Browsing Patterns", Knowledge and Information Systems, pp.5–32,1999. [14] W. W. Cohen, "Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity”,inthe proceeding of ACM SIGMOD Conference on Management of Data, pp. 201–212, 1998. www.irjcjournals.org 36