Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Novel Research in Computer Science and Software Engineering Vol. 2, Issue 1, pp: (36-42), Month: January - April 2015, Available at: www.noveltyjournals.com Web Mining and Data Mining: A Comparative Approach 1 1,2 Simranjeet Kaur, 2Kiranbir Kaur Department of Computer Science & Engineering, Guru Nanak Dev University, Amritsar, India Abstract: Search engine is used to retrieve data from the web in response to user’s queries. Web mining is a technique of data mining, which is used to discover data from the web documents. With the help of Web mining, the user obtains the required information accurately. Web mining is categorized into three types: web content mining, web structure mining and web usage mining. In this paper, difference of three categories of web mining and comparison of web mining and data mining is also presented. Data mining helps to extract data from database, whereas data is extracted from web in web mining. Keywords: Web Mining, Data Mining, Web Structure Mining, Web Content Mining, Web Usage Mining. I. INTRODUCTION As the World Wide Web is very enormous and contain all sort of information. With the growing information needs and sources, it becomes necessary to manage the information and to display most relevant information to the user screen through search engine. Users extract information from www with the use of search engine as information retrieval tool. Commonly used search engines are Google, Yahoo, Bing, etc. They index, download and store billions of web pages, and act as content aggregators because they keep every record of web pages available on the WWW [1]. Figure 1: Concept of search engine [1] There are primary two kinds of search services available on the internet [2]: A. Organic Search Engine: To index websites into large database, automated programs known as “spiders” or “robots” are used. Spider index entire website when user submits page to organic search engine. Examples are Google, MSN, Yahoo, etc. B. Search Directories: It include list of categories reviewed by human that rely on site owner’s submission. It’s all reviewer responsibility to describe the site in what text. Examples include Open Directory (Google Directory), Yahoo Directory, Look smart and others. Page | 36 Novelty Journals International Journal of Novel Research in Computer Science and Software Engineering Vol. 2, Issue 1, pp: (36-42), Month: January - April 2015, Available at: www.noveltyjournals.com II. OVERVIEW OF SEARCH ENGINE OPTIMIZATION Search Engine Optimization (SEO) technique is a part of Search Engine Marketing (SEM), to increase the traffic to websites. It helps the spider program to find the relevant webpage and provide it higher rank on the web, by analyzing the webpage content, keywords and webpage code [3]. III. EVOLUTION OF WEB MINING Web Mining process began when the business data started storing on computers and internet and continued with improvements in data access, and with use of WWW. Data is collected from various sources like surveys, networked locations through the use of computers, and used as required [7]. Web mining is a revolutionary process, and the user gets the answer quickly and accurately. The evolution of web mining is shown below in table 1: Table I: Evolution of web mining [7] Evolutionary Steps 1960s Data Collection 1980s Data Access 1990 Data Warehousing and Decision Support Technologies Product Providers Business Need Characteristics Computers, disks CDC, IBM To know total Delivery of static and tapes revenue in last years data SQL, RDBMS, Oracle, IBM , To know about unit Dynamic data ODBC Sybase, Microsoft sales delivery Multi-dimensional Microstrategy, To know about unit Dynamic data database, data Arbor, Congnos, sale and also to take delivery at multiple warehouses, OnPilot decision on it levels Line Analytic Processing Multiprocessor IBM, SGI, Pilot, What to do with data Prospective 2000s computers, advanced Lockheed and why information delivery Data Mining algorithms, Massive databases WWW, Internet, IBM, Web Trends, To know about Affordable tool to Emerging Today monumental scale Net Genesis, business in past or mine large data Web Mining database, learning Rockware, Apecto future warehouse and technology like Limited, Heckyl relational databases supervised like rule Technologies efficiently and fast generalization and using multiple unsupervised like mining functions, pattern mining Powerful Web Mining is a data mining technique that helps to extract or discovers information from the web documents. Web data can be in the form of web pages like text and images, inter-page structure i.e. linkage structure between web pages, intra page structure i.e. include XML or HTML tags and about the usage of data on the internet [5]. IV. WEB MINING PROCESS The process of extracting knowledge from web is as follows [4]: Figure 2: Web Mining Process [4] Page | 37 Novelty Journals International Journal of Novel Research in Computer Science and Software Engineering Vol. 2, Issue 1, pp: (36-42), Month: January - April 2015, Available at: www.noveltyjournals.com In Web mining, data mining is used to extract the hidden data in web log (usage data). The subtasks of web mining are shown in figure 3: Figure 3: Subtasks of web mining [7] V. WEB MINING CATEGORIES Web Mining is categorized into three types: web content mining, web structure mining and web usage mining as shown in figure 4. Figure 4: Categories of Web Mining [7,9] A. Web Content Mining It is the process to retrieve information from web into structured form and index the information, and extract it quickly. Its main focus is on the structure on the inner document and it consists of text, images, audio, video or structured records such as tables and lists [6]. Page | 38 Novelty Journals International Journal of Novel Research in Computer Science and Software Engineering Vol. 2, Issue 1, pp: (36-42), Month: January - April 2015, Available at: www.noveltyjournals.com B. Web Structure Mining It is a process to model the linking structure of web pages. Its main motive is to generate structured summary of web pages and websites [6]. This is carried out at the hyperlink level (inter-page) and at document level (intra-page). This type of mining is useful for retrieving information [4]. Figure 5: Web graph structure [8] Web structure mining consists of three categories [9]: Link Structure: It include link based cluster analysis, link based classification, link type and link strength. Internal Structure Mining: It provides the information about ranking of pages, and to enhance the search results by discovering the model underlying the link structure. It is also helpful to analyze the relationship between different websites. URL Mining: In this category, hyperlink is used to connect a web page to different locations, either within same or different web page. C. Web Usage Mining: This type of mining is used to discover useful information and navigation patterns from the web present in server logs, agent logs, referrers log, client-side cookies, Meta data and user profile. It introduces privacy concern [8]. Figure 6: Web usage mining process [8] When the user interacts with web, it is used to predict user behavior [7]. It consists of three phases: Preprocessing: Proxy server and server is the first way used to retrieve raw data from the web and process it, and it makes automatic transformation of original raw data [9]. Pattern Discovery: After discovering the knowledge from preprocessing, techniques are implemented to discover the knowledge such as machine learning, data mining procedures and other related procedures [9]. Pattern Analysis: After the stage of pattern discovery, pattern analysis work is to check the pattern on the web and its implementation to extract the knowledge from the web [9]. Page | 39 Novelty Journals International Journal of Novel Research in Computer Science and Software Engineering Vol. 2, Issue 1, pp: (36-42), Month: January - April 2015, Available at: www.noveltyjournals.com Table II: Categories of Web Mining [6, 9] Web Mining Categories Specification View of data Web Content Mining Structured Semi-structured Unstructured Primary Text document Hypertext document Concepts or ontology Relational Edge labeled graph n-grams Terms, Phrases Machine learning Association rules Proprietary algorithm Statistical method It tells about the discovery of useful information from web documents/contents Web Structure Mining Structure linkage Web Usage Mining Interactive data Primary Link structure Secondary Browser logs Server logs Relational table Graph Scope Type of data used Main data Representation Graph Proprietary algorithm Machine learning Statistical method It discover the model underlying web’s link structure Its work is to make sense of the data generated by web surfer’s behavior The scope of data is local in DB and is global in IR. Global Global Goal It mainly targets to discover knowledge To generate structural summary about the web page and web site [10] To analyze the behavioral pattern and users profile while interacting with web sites Application areas Clustering Categorization Finding extraction rules and patterns User modeling Finding frequent subschemas Data/Information extraction Opinion extraction from online sources Segmenting web pages and noise detection [10] Clustering Categorization User modeling Site construction Marketing Adaptation and Management Not all pages have relevant meta information Entire text of predecessor page is not relevant Preprocessing challenges about the users i.e. who will be user, how long it stay, where will it go and user view? Method Tasks Challenges VI. COMPARISON OF WEB MINING AND DATA MINING Data Mining (also called data or knowledge discovery) is the process to find hidden information or patterns in databases [11], and summarizing it into useful information. It is also the process to find co-relations among large number of fields in huge relational databases [17], whereas Web Mining is a technique that is used to retrieve the documents from the web. Data Mining includes the five major steps: extract, store and manage data, provide data access, analyze the data and present it in a useful form, like in the form of table or graph [17]. Page | 40 Novelty Journals International Journal of Novel Research in Computer Science and Software Engineering Vol. 2, Issue 1, pp: (36-42), Month: January - April 2015, Available at: www.noveltyjournals.com Table III: Comparison of Web Mining and Data Mining [19] Web Mining VS Data Mining Comparison Web Mining Data Mining Definition Process used to extract information from web documents. Process used to extract hidden information from the database. Scale It contains 10 million jobs in server database, and therefore search processing is not big. It contains 1 million jobs in database and search processing is large. Structure The information is obtained from structured, semi-structured and unstructured web forms. It gets the information from wide database. It obtains the information from explicit structure. It is not able to get all the information from wide database as compared to web mining. Access Data is accessed publicly. In this, data is not hidden in web database and only permission is required to access the data from web log master. Data is accessed privately and only authorized user can access the data. Data It works upon on-line data. It works upon off-line data. Data Storage Data is stored in server logs and web server database. Data is stored in data warehouses. Application Areas E-learning, Digital Libraries, EGovernment, Electronic Commerce, E-Politics, E-Democracy, Security & Crime Investigation, Electronic Business [13]. Banking, marketing, manufacturing & production, health-care, insurance, law, airlines, computer hardware & software, government & defence, etc [12]. Disadvantages url’s can be tracked to access the data, multiplicity of events and url’s, large amount of data remain unused Privacy issues, security issues, misuse of information/ inaccurate information [12]. Challenges Complexity of web pages, web is too huge, relevancy of information, web is dynamic information source, diversity of user communicates, etc [14]. Network settings, data quality, privacy preservation, scalability, complex and heterogeneous data, etc [12]. Techniques Web Content Mining, Graph Based Web Mining, Utilization in Web Mining, Text Mining and many others [15]. Artificial Neural Network, Decision Trees, Rule Induction, Nearest Neighbor Method and many others [16]. VII. CONCLUSION The need to access and manage data on the web is increasing day by day. Web mining enhances user's ability to access information from www and finds its applications in various fields like Clustering, Categorization, in extraction rules and patterns and many others. In the comparison of web mining with data mining, it is concluded that web mining is used to retrieve online data, and data mining retrieves offline data. Data is stored in server database in web mining and it can handle multiple transactions at the same time. Data can be discovered and extracted from multiple locations of the world by sitting at one location and is able to provide desired information at the time of requirement. Page | 41 Novelty Journals International Journal of Novel Research in Computer Science and Software Engineering Vol. 2, Issue 1, pp: (36-42), Month: January - April 2015, Available at: www.noveltyjournals.com REFERENCES [1] N. Duhan, A.K. Sharma and K. K. Bhatia, “Page Ranking Algorithms: A Survey”, IACC 2009 [2] Clifton and N. Rae, “How Search Engine Optimization Works”, SEO Whitepaper [3] Z. Hui, Q. Shigang, L. Jinhua and C. Jianli, “Study on Website Search Engine Optimization”, International Conference on Computer Science and Service System, 2012 [4] N. Tyagi and S. Sharma, “Comparative study of various Page Ranking Algorithms in Web Structure Mining”, IJITEE, ISSN: 2278-3075, Volume-1, Issue-1, June 2012 [5] Jain, R. Sharma, G. Dixit and V. Tomar,” Page Ranking Algorithms in Web Mining, Limitations of Existing methods and a New Method for Indexing Web Pages”, International Conference on Communication Systems and Network Technologies, 2013 [6] R. Jain and Dr. G.N Purohit, “International Journal of Computer Applications (0975-8887) Volume 13-No.5, January 2011 [7] K. Sharma, G. Shrivastava and V. Kumar, “Web Mining: Today and Tomorrow”, IEEE, 2011 [8] M. T. Ramakrishna, L. K. Gowdar, M. S. Havanur and B. P. M. Swamy, ”Web Mining: Key Accomplishments, Applications and Future Directions”, International Conference on Data Storage and Data Engineering, 2010 [9] B. Rathod and Dr. S. Khanna, “A Review on Emerging Trends of Web Mining and It’s Applications”, IJEDR, ISSN: 2321-9939 [10] Web Content Mining Problems/Challenges, available at, “https://sites.google.com/site/assignmentssolved /mca/ semester6 /mc0088/12” [11] N. Padhy, Dr. P. Mishra and R. Panigrahi, “The survey of Data Mining Applications And Future Scope”, International Journal of Computer Science, Engineering and Information Technology, Vol.2, No.3, June 2012 [12] S. H. Begum, “Data Mining Tools and Trends – An Overview”, International Journal of Emerging Research in Management & Technology, ISSN: 2278-9359, February 2013 [13] S. Yadav, K. Ahmad and J. Shekar, “Analysis of Web Mining Applications and Beneficial Areas”, IIUM Engineering Journal, Vol. 12, No. 2, 2011 [14] Data Mining – Mining World Wide Web available at “http://www.tutorialspoint.com/data _mining/dm_ mining_ www.htm” [15] P. Bhisikar and A. Sahu, “Overview on Web Mining and Different Technique for Web Personalization”, In International Journal of Engineering Research and Applications, ISSN: 2248-9622, Vol. 3, Issue 2, March-April 2013 [16] An Introduction to Data Mining, available at, “http://www.thearling.com/text/dmwhite/dmwhite.htm” [17] Data Mining: What is Data Mining? Available at,”http://www.anderson.ucla.edu/faculty/jason. frand/teacher/ techno -logies/palace/datamining.htm” [18] Rastogi, S. Gupta, S. Agarwal, N. Agarwal, ”Web Mining: A Comparative Study”, International Journal of Computational Engineering Research, ISSN: 2250-3005, Vol. 2, Issue No. 2, Mar-Apr 2012 [19] R. Sharma, “A Framework to Compare Web Mining Types”, In International Journal of Advanced Research in Computer Science and Software Engineering, ISSN: 2277-128X, Volume 3, Issue 7, July 2013 Author Profile: Simranjeet Kaur belongs to Amritsar, Punjab (India) and her Date of Birth is 15 December, 1989. She is B.Tech (IT) and pursuing M.Tech (Software System) from Guru Nanak Dev University in the Department of Computer Science and Engineering, Amritsar, India. Her interest area is search engine optimization and web information retrieval. She completed this paper under the guidance of Kiranbir Kaur, Assistant Professor in Guru Nanak Dev University, Department of Computer Science and Engineering, Amritsar, India. Page | 42 Novelty Journals