Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 3, Issue 6, June- 2014 ISSN (Print) 2320- 5156 A Review of Various Techniques of Web Content Mining For HTML and XML Contents Rupinder Kaur1 Kamaljit Kaur2 Student of M.Tech Department of Computer Science Engineering Sri Guru Granth Sahib World University Fatehgarh Sahib, Punjab, India Assistant Professor Department of Computer Science Engineering Sri Guru Granth Sahib World University Fatehgarh Sahib, Punjab, India Abstract— World Wide Web is the largest source of information. Most of the data on the web is dynamic and is in unstructured form. It is becoming difficult to get the relevant data from the web. Data Mining is the field of computer science which is used to extract knowledge from very large amount of data. Web mining is the application of data mining, which implements various techniques of data mining to get the efficient knowledge from the web data. This paper presents an overview of various techniques that has been used for web content mining including images, audio, video and semi-structured contents like HTML and XML. Since HTML has many limitations like limited tags, not case sensitive and designed to display data only, Web developers has started to develop Web pages on emerging Web Technologies like XML, Flash etc. XML was designed to describe data and to focus on what the data is. XML also plays the role of a meta- language and allows document authors to create customized markup language for limitless different types of documents, making it a standard data format for online data exchange. efficiently to fetch web data. It is very wide area of research for the researchers because of the growing use of the web. Moreover, most of the data on the web is in unstructured form i.e. in the form of free text, images, audio, video and semi-structured documents like HTML and XML etc. Since HTML has many limitations like limited tags, not case sensitive and designed to display data only, Web developers has started to develop Web pages on emerging Web Technologies like XML, Flash etc. [4] XML was designed to describe data and to focus on what the data is. XML also plays the role of a meta language and allows document authors to create customized markup language for limitless different types of documents, making it a standard data format for online data exchange. This growing use has raised need for better tools and techniques to perform mining on XML too. [3] This paper reviews the various techniques that have been used to mine the XML and HTML data on the web. [email protected] Keywords - Data Mining, Web Mining, Web Content Mining, XML, Vector Space Model, FP-tree. a) I. INTRODUCTION Web is the largest source of information. As the number of documents grows, searching for information is b) turning into a cumbersome and time consuming operation. Data Mining is the field of computer science which is used to extract information from the large amount of data. Web Mining is the application of data mining which is used to generate patterns from the web. Patterns must be such that they are easily understandable, useful and novel. Various c) techniques of data mining are used to extract the information from the web. Not only data mining but also other tools from fields of artificial intelligence, machine learning, natural language processing can also be used www.ijrcct.org [email protected] Web Mining on the basis of type of data to be explore can be classified into three main categoriesWeb Content Mining Web content mining is the process of mining of contents of a web page. Contents of a web page may include text, image, audio, video and semi-structured records in the form of html and xml contents which are either embedded in the web page or having links to other pages. Web Structure Mining The Web page structure consists of a Web page as a node and hyperlinks as edges connecting to other pages. [6] In other words, it works in the form of graphs. It focuses on the connectivity of the web site to other sites that are called as hyperlinks. Web Usage Mining It is the mining of the user preferences while user is navigating through the websites. This is done by applying the mining process on the log files repository. [6] By web usage mining, commercial websites take advantage of Page 669 International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 3, Issue 6, June- 2014 ISSN (Print) 2320- 5156 knowing the usage pattern of customer, their behaviour and frequency of their visits. II. WEB CONTENT MINING Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page contents. Content mining is the scanning and mining of text, pictures and graphs of a Web page to determine the relevance of the content to the search query. With the massive amount of information that is available on the World Wide Web, content mining provides the results lists to search engines in order of highest relevance to the keywords in the query retrieval view. [10] The use of the Web as a provider of information is unfortunately more complex than working with static databases. Because of its very dynamic nature and its vast number of documents, there is a need for new solutions that are not depending on accessing the complete data on the outset. Another important aspect is the presentation of query results. Due to its enormous size, a web query can retrieve thousands of resulting webpages. Thus meaningful methods for presenting these large results are necessary to help a user to select the most interesting content. Various web contents that can be mined are text, image, audio, video and structured or semi-structured records. For the semi-structured records all the work utilizes the HTML and XML structure of the web page. These are explained as belowA. Animation These can be created using GIF images or using Flash, Ajax, or other animation tools. B. Images Images are the most common way to add multimedia to websites. One can use photos or clip art or even drawings. C. Sound Sound is embedded in a Web page so that readers hear it when they enter the site or when they click a link to turn it on. D. Video Video is getting more and more popular on Web pages. But it can be challenging to add a video so that it works reliably across different browsers. [13] E. Hyper Text Markup Language Hyper Text Markup Language by Berners-Lee in mid-1993 is still the dominant way whereby we share content. And while there are many web pages with localized proprietary structure (most usually, business websites), many millions of websites abound that are structured according to a common core idea of HTML. Blogs are a type of website that contains mainly web pages authored in html. Search engine sites are composed mainly of html content, but also have a typically structured approach to revealing information. Discussion boards are sites composed of "textual" content organized by html or some variation that can be viewed in a web browser. Ecommerce sites are largely composed of textual material and embedded www.ijrcct.org with graphics displaying a picture of the item(s) for sale. However, there are extremely few sites that are composed page-by-page using some variant of HTML. [14] F. Xtensible Markup Language In an attempt to standardize the format of data exchanged over the Web and to achieve interoperability between the different technologies and tools involved, World Wide Web Consortium (W3C) introduced eXtensible Markup Language (XML). XML is a simple but very flexible text format derived from Standard Generalized Markup Language (SGML), and has been playing an increasingly important role in the exchange of wide variety of data over the Web. Even though it is a markup language much like the Hyper Text Markup Language (HTML), XML was designed to describe data and to focus on what the data is, whereas HTML was designed to display data and to focus on how the data looks on the Web browser. [3] A data object described in XML is called an XML Document. XML also plays the role of a metalanguage, and allows document authors to create customized markup languages for limitless different types of documents, making it a standard data format for online data exchange. This growing usage of XML has naturally resulted in the increasing amount of available XML data which raises the pressing need for more suitable tools and techniques to perform data mining on XML documents and XML repositories. III. RELATED WORK Various techniques have been used to fetch the web data till now. Besides semi-structured contents of a web page, the unstructured contents like images, audio, videos etc. are fetched using different algorithms. These are reviewed in [12] with their various pros and cons. Review of techniques that have been used for XML content mining are explained as under: A. XML Algebras for Data Mining The prerequisite of designing XML mining algebras is that all generalized knowledge must be expressed in XML, i.e. input and output of operators must have the same form. Extension of existing XML algebras by adding some mining operators might work well as a mining algebra. Each operator is associated with an algorithm for the mining task. [5] A further development is to decompose mining tasks into several sub-tasks. These sub-tasks are shared among different mining tasks. All operators should have the following properties: 1. The operation of each operator should be clearly defined and easily implemented. 2. The input and output of each operator should be in the same form. The ultimate goal of XML algebras is to provide operational procedure to obtain answers of user’s queries. Advantages and Disadvantages: How to decompose the mining tasks into sub-tasks is still an open problem, and it is largely dependent on the nature of the mining tasks. Page 670 International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 3, Issue 6, June- 2014 ISSN (Print) 2320- 5156 B. Vector Space Model The VSM is a special type of kernel methods for textual data. It transforms a document into a numerical vector, each element of which is an indicator whether or not the document contains a certain word (or phrase). A VSM can be represented either in a boolean vector model, or in a term weighting model, which has scalar values that take into account the appearance frequency of a term in a set of documents. [5] A widely used weighting method is ’Term Frequency-Inverse Document Frequency (TF-IDF)’, in which the weight of the ith term in the jth document, denoted as wij, is defined by wij = TFij × IDFi = TFij × log (n/DFi), where TFij is the number of occurrences of the ith term within the jth document, and DFi is the number of documents (out of n) in which the term appears. Advantages and Disadvantages: It requires too much computation for real world documents. C. Kernel-based XML Mining The kernel methods for structured data are applicable to XML mining in various ways. The string kernels with minor modifications are used to compare two pieces of context information (i.e., successive element labels in a path). Undoubtedly, the VSM also provides the same state of-theart performance in XML content mining as in traditional text mining. The tree kernels are used to measure the structural similarity between XML schema documents and between XML instance documents. By transforming the tree structure into a string (e.g., by a depth-first traversal), the string kernels and the VSM are also used to compute the structural similarity. Advantages and Disadvantages: The kernels are used to measure the structural similarity between XML schema documents and between XML instance documents by transforming the tree structure into a string. Since kernels are mainly developed for bio/chemistry -informatics, new variant kernels for texts (and XML contents) are required. D. First FP-Growth Algorithm In [5] Nassem et al. constructs an FP-tree structure and mines frequent patterns by traversing the constructed FP-tree. The FP-tree structure is an extended prefix-tree structure involving crucial condensed information of frequent patterns. In addition, some features have been suggested that need to be added into XQuery in order to make the implementation of the First Frequent Pattern growth more efficient. In future work of paper it was planned to implement other standard data mining algorithms which can be expressed in XQuery to improve the performance of the results. Advantages and Disadvantages: This method is advantageous because, it doesn’t generate any candidate items. It is disadvantageous because, it suffers from the issues of special and temporal locality issues. www.ijrcct.org Following are experimental results of the parameters that are calculated for FP-Growth Algorithm when executed on HTML and XML contentsTABLE 1 S.No URLs Time Precision Recall Taken FMeasure (ms) 1 https://in.yahoo. 2 http://www.ama 3 https://www.fac 4 http://www.eba 5 http://in.msn.co 6 http://www.hom 7 http://www.flip 8 http://adaptive- 9 http://tanishq.co 10 http://www.natu 7000 0.0208 0.247 0.7422 7000 0.1818 0.131 0.395 4000 0.2 0.0769 0.230 6000 0.0256 0.0763 0.2289 7000 0.0232 0.1575 0.4725 6000 0.0769 0.2363 0.7090 5000 0.0181 0.0516 0.155 1000 0.0714 0.3043 0.9130 7000 0.0909 0.0973 0.2920 6000 0.0303 0.2773 0.8319 com/?p=us zon.in/ ebook.com/ y.in/ m/ eshop18.com/ kart.com/ images.com/ .in/Home rewallpaper.net/ Results of FP-Tree Algorithm E. Canonical Tree (CAN Tree) It is a tree structure that arranges or orders the nodes of a tree in some canonical order. It follows a tree-based incremental mining approach. Like FP-tree approach, there is no need to rescan the transactional database when it is updated. Because of following the canonical order, frequency changes (if any) due to incremental updates like insertion, deletion and modification of the transactions will not affect the ordering of the nodes in CAN tree. [8] After constructing the CAN tree, the frequent patterns can be mined from the tree. After constructing the CAN-tree, we have to mine the frequent patterns by traversing the tree in an upward direction. This can be done similar to FP-growth by constructing a header table and finding only the frequent items. Advantages and Disadvantages: This method is advantageous because it supports incremental updates without any major changes in the tree. Page 671 International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 3, Issue 6, June- 2014 ISSN (Print) 2320- 5156 F. COFI-Tree COFI is much faster than FP-Growth and requires significantly less memory. The idea of COFI is to build projections from the FP-tree each corresponding to sub transactions of items co-occurring with a given frequent item. These trees are built and efficiently mined one at a time making the footprint in memory significantly small. The COFI algorithm generates candidates using a top down approach, where its performance shows to be severely affected while mining databases that has potentially long candidate patterns that turns to be not frequent, as COFI needs to generate candidate sub patterns for all its candidates patterns and also build upon the COFI approach to find the set of frequent patterns but after avoiding generating useless candidates. [8] The basic idea of algorithm is based on the notion of maximal frequent patterns. A frequent item set X is said to be maximal if there is no frequent item set X’ such that X€ X’. Advantages and Disadvantages: This method is advantageous because it requires only one scan of the database and the trees are ordered according to their local frequency in the paths. It is disadvantageous because lot of computation is required to build the tree. G. Improved Frequent Pattern Tree Algorithm In [9] Mishra et al. have proposed an improved technique that extract association rules from XML documents without any preprocessing or post processing. The proposed improved algorithm for mining the complete set of frequent patterns used patterns fragment growth. First Frequent Pattern-tree based mining adopts a pattern fragment growth method to avoid the costly generation of large number of candidate sets and a partition based, divide-and-conquer method is used. They proposed an association data mining tool for XML data mining. It will increases the mining efficiency and also takes less memory. Advantages and Disadvantages: It recovers the weakness of some traditional data mining algorithms. It is difficult to implement for complex structures of XML. IV. CONCLUSION Web Mining is the application of data mining which aims to extract knowledge from the web data. As advanced technology is emerging, web sites are now developing with newer and safer methods like with XML, flash techniques. In this paper various techniques for XML and HTML mining are reviewed. Some of these techniques are executed and results are also provided. As web is growing rapidly and new advancements are coming on daily basis, there is need of reliable tools in order to fetch web data. www.ijrcct.org Most of the mining work is done on HTML contents. However as XML is a new technology and its usage is increasing, it is need of the time to build efficient methods for XML contents mining too. ACKNOWLEDGMENT As a part of my course I have reviewed “A Review of Various Techniques of Web Content Mining for HTML and XML Contents”. I am very thankful to Ms. Kamaljit Kaur, Assistant Professor, Sri Guru Granth Sahib World University, Fatehgarh Sahib for giving me such a valuable support in doing my work. Last but not least, a word of thanks for the authors of all those books and papers which I have consulted during my work. REFERENCES [1] O. Etzioni, "The world wide Web: Quagmire or gold mine.” Communications of the ACM, Vol. 39 No. 11, pp. 65-68, Nov. 1996. [2] R. Kosala, H. Blockeel “Web mining research: A survey,” ACM SIGKDD Explorations, Vol. 2 No. 1, pp. 1-15, June 2000. [3] Qin Ding and Gnanasekaran Sundarraj, “Association Rule Mining from XML Data”, Conference on Data Mining | DMIN'06 |. [4] Krishna Murthy. A, Suresha, “XML: URL Data Set Creation for Future Web Mining Research Avenues”, International Journal of Computer Applications (0975 – 8887), 2011. [5] Mohammad. Nassem, Prof. Sirish Mohan Dubey, “First Frequent Pattern-tree based XML pattern fragment growth method for Web Contents”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 1, January 2012. [6] Ganesh Dhar, Govind Murari Upadhyay, “Web Mining: Concepts and Decision Making Aid”. Volume 2, Issue 7, July 2012. [7] Darshna Navadiya, Roshni Patel, “Web Content Mining TechniquesA Comprehensive Survey”, IJERT, Vol. 1 Issue 10, December- 2012. [8] Amit Kumar Mishra, Hitesh Gupta, “A Recent Review on XML data mining and FFP”, International Journal of Engineering Research, Volume No.2, Issue No.1, pp : 07-12 [9] Amit Kumar Mishra, Hitesh Gupta, “A Novel FP-Tree Algorithm for Large XML Data Mining”, International Journal of Engineering Sciences and Research Technology, [Mishra, 2(3): March, 2013]. [10]Abdelhakim Herrouz, Chabane Khentout Mahieddine Djoudi, “Overview of Web Content Mining Tools”, The International Journal of Engineering And Science, Volume 2, Issue 6, pp. 106-110, June 2013. [11] Buhwan Jeong, Daewon Lee, Jaewook Lee, and Hyunbo Cho, “Towards XML Mining: The Role of Kernel Methods”. [12] D. Jayalatchumy, Dr. P.Thambidurai, “Web Mining Research Issues and Futur Directions – A Survey”, IOSR Journal of ComputerEngineering (IOSR-JCE), Volume 14, Issue 3 (Sep. - Oct. 2013), pp. 20-27. [13] http://en.wikipedia.org/wiki/Web_content. [14] http://webdesign.about.com/od/content/qt/what-is-web-content.htm. Page 672