Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 WEB PAGE CHANGE DETECTION THROUGH MACHINE LEARNING ALGORITHMS R.Kousalya MCA., M.Phil. (Ph.D) Head Department of Computer Applications Dr.N.G.P Arts and Science College, Coimbatore. [email protected] J.Rubana Priyanga M.Phil Scholar Dr.N.G.P Arts and Science College, Coimbatore. [email protected] ABSTRACT This research work is mainly used to detect changes in the web page. The existing Case per technique is taken for research and compared with the proposed technique Arti-Q Dynamic web page change detection. Artigence technique is an Arti-Q algorithm, used for reducing response time for each request. The Arti-Q technique introduces two heuristics method such as internal proxy and external proxy. The methodology used for web page change detection are Net Tracker which is used to track the commodity market server to detect the commodity rate changes in Gold, silver, INR, Palladium, Platinum, Rhodium and Crude. The proposed Arti-Q DWPD technique receives the URL as a parameter to detect the HTML source code and using substring comparison to filter out the commodity rate changes. The detected rates are compared with the lastly retrieved rate and if there is any change identified between them, it will be displayed in the list view control and for every change detection, the varying count gets increment the value by 1.The process repeated until the threshold time exceed as for every 0.5 seconds. The experimental results shows that the implementation improves an accuracy level while compared with existing implementation. Keywords: Case per technique, Arti-Q DWPD technique, Net Tracker, internal proxy and external proxy. INTRODUCTION Web mining is a type of data mining is used to extract web data from web pages. As data mining basically deals with the structured form of data, web mining deals with the unstructured and semi structured form of data. Web mining consist of three techniques such as web content mining, web structure mining and web usage mining for web data extraction. The World Wide Web has evolved from a simple and static content source into a richly dynamic information deliver channel. Hence, tracking the changes in web pages has become an interesting research problem would solve the problem and facilitate users to efficiently utilize the information that is provided by web pages. 57 International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 WEB USAGE MINING IN DYNAMIC WEB PAGE CHANGE DETECTION The dynamic web page change detection is processed under the web mining concepts, and the web usage mining is the basic component being used to detect changes in web page. Web usage mining is the application of data mining techniques to discover interesting usage patterns from web usage data, in order to understand and better serve the needs of web-based applications. Web usage mining is a process of mining useful information from server logs. Usage data captures the identity or origin of web users along with their browsing behavior at a web site. Web usage mining can be classified further depending on the kind of usage data considered as web server data, application server data and application level data. In Web Server Data, user logs are collected by the web server and typically include IP address, page reference and access time. In Application Server Data, commercial application servers such as Web logic, Story Server, have significant features to enable E-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. In Application Level Data, new kinds of events can be defined in an application, and logging can be turned on for them, generating histories of these events. It must be noted, however, that many end applications require a combination of one or more of the techniques can be applied. Types of Web Mining CHANGE DETECTION AND NOTIFICATION Change detection and notification (CDN) refers to automatic detection of changes made to World Wide Web pages and notification to interested users by email or other means. Whereas search engines are designed to find web pages, CDN systems are designed to monitor changes to web pages. Before change detection and notification, it is necessary for users to manually check for web page changes, either by revisiting web sites or periodically searching again. Efficient and effective change detection and notification is hampered by the fact that most servers do not accurately track content changes through Last-Modified headers. 58 International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 PROBLEM DEFINTION Due to limited network and computational resources, it is often difficult to monitor the sources constantly to check for changes and to download changed data items to the copies. Finding the changes between different versions of the web pages is the core operation of web page change detection system. Users who visit a web page repeatedly at frequent intervals are more interested in knowing the recent changes that have occurred on the page than the entire contents of the web page. Because of the increased dynamism of web pages, it would be difficult for the user to identify the changes manually. The main objective is to reduce the complexity of the change-detection by focusing only on the web page change identification in which the changes have occurred. RELATED WORK In this paper, L.Liu, C.Pu and W. Tang, offers personalized delivery of change notifications as well as summarization and prioritization of Web page changes. One drawback of Web CQ is that it detects changes only between the last two versions of a HTML page. This algorithm is efficient in speed and memory space. It uses operations such as change node, delete node and insert node. Delta is constructed to find the matching of nodes between two trees.The use of XML specificities in algorithm leads to significant improvements. Drawback of this algorithm is that there is some loss of quality. Another drawback is that there is need of gathering more statistics about the size of deltas and in particular for real web data. In this paper, Varshney Naveen Kumar, Dilip Kumar Sharma proposed a Tree based approach is good for comparing the nodes of both the tree as old web page tree and modified web page tree. It gives the relevancy to the web pages and notifies the user about detecting the changes. For detecting structural change document tree is constructed and then signature values assigned to the root nodes and child nodes of the old and new web pages are compared. This algorithm defines the good comparison study for the different algorithms and provides simple method for detecting changes. In this paper, K.S. Kuppusamy, G.Aghila, proposed a CaSePer Model which is Change detection based on Segmentation with Personalization an enhanced model for detecting changes in the pages. The change detection is micro-managed by introducing web page segmentation. The web page change detection process is made efficient by having it perform a dual-step process. Changes in web pages can either be content changes or structural changes. The structural changes are primarily changes that occur at the template level. The content changes include modification, insertion or deletion of hypertext elements. Users navigate to web pages that they are interested in. A user interested in a specific topic might visit web pages that are related to that topic at regular intervals over time. Such users might be interested in knowing the recent changes that have occurred in that web page rather than seeing the entire web page. 59 International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 SYSTEM ARCHITECTURE The proposed work includes processing functionalities and interfaces for processing user request, fetching web pages from the internet allowing users to select zone in web pages to monitor. This method performs well and is able to detect the structural as well as content level changes even at the minute level and helps to locate minor or major changes within the selected zone of document. The properties of HTML page to extract the HTML source using http web request function which passes the website address as parameter where the function gets the Uniform Resource Locator (URL) of the internet resource that actually responds to the request. A web crawler could be a system for the bulk downloading of websites. A crawler starts off by placing an initial set of URLs, in a queue, where all URLs to be retrieved are kept and prioritized. The crawler gets a URL in some order from this queue, downloads the page, extracts any URLs within the downloaded page, and then in the queue it puts the new URLs. Artigence is the technique which uses the concept called Arti-Q DWPD method to reduce the waiting time request of web pages and calculate the threshold value. Input unit Fetch unit Web crawler HTML TO XML Converter Selection of zone in web page Presentation and storage module Highlighted output of web page CC Tree Builder for selected Zone Comparator ARTI- Q Web Page Change Detection System ARTI-Q ALGORITHM Step 1: Create a two 2-dimensional array and named it as External proxy Ep(i, j) and internal proxy Ip(i,j), where ‗i‘ represents Queue Number and ‗j‘ represents substring data which holds silver, platinum, palladium, crude, rhodium and INR value. Step 2: Create an Arti-Q Timer Control ‗T‘ in Ip (i, j) with interval set to 0.5 seconds. Step 3: Initialize predefined syntax Ps, variant V =0 60 International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 Step 4: For every ‗T‘ hits 0.5 second Read Ep (i, j) extracted from the web page Compare Ip (i, j) = Ep(i,j), if the variant detects, then increment V = V +1 Ep (i, j) =0 (Reset External proxy counter to zero). Step 5: repeat Step 4. RESULT ANALYSIS The proposed methodologies on web page change detection have been effectively used in the Arti-Q Dynamic Web Page Change Detection with effective techniques. The output for the given query is processed and implemented in WEKA TOOL. The page source is extracted using the DWPD method. The URL is passed as a parameter in the direct cast method and using GET function, the page source is extracted and the value is allowed to store in a temporary array. A timer control is implemented where the page source is extracted for every 30 seconds and compare with the temporary array whether any changes in the content is made or not. If any change detected, then the changed value has been displayed to the user. The result shows the given output from various processes is better than existing process. The result has been analyzed. Arti-Q-DWPD Duration No. of Variants Duplicates Detected 1-30 minutes 62 0 31-60 minutes 57 0 61-90 minutes 58 0 91-120 minutes 49 0 Number of variants ArtiQ - DWPD 100 50 0 No. of Variants 1-30 minutes 62 31-60 minutes 57 61-90 minutes 58 91-120 minutes 49 (Y-axis no.of variants-axis time seconds) Evaluated Arti-Q DWPD 61 International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 From the above Arti-Q DWPD table, it is clearly stated in the first 1-30 minutes 62 variants detected in the page change detection with no duplicates identified; similarly in the next 31-60 minutes, 57 variants detected with no duplicates identified. In the next 61-90 minutes, 79 variants detected with no duplicates identified and in the last 91-120 minutes, 65 variants detected with no duplicates identified. No. of Variants COMPARISON OF VARIANTS DETECTED BETWEEN EXISTING AND PROPOSED 80 60 40 20 0 1-30 minutes CasePer 75 31-60 minutes 68 Arti-Q 62 57 61-90 minutes 79 91-120 minutes 65 58 49 The experimental result shows that our proposed approach provides higher performance rate while compare with conventional methods. The threshold values exceed for every 0.5 seconds and it list the value of data rate changes for every 0.5 seconds. Threshold based silver Rate Variance Threshold based Gold Rate variance ₹ 1,540.00 35.00 ₹ 1,520.00 34.00 ₹ 1,500.00 33.00 ₹ 1,480.00 32.00 ₹ 1,460.00 31.00 ₹ 1,440.00 30.00 ₹ 1,420.00 29.00 ₹ 1,400.00 28.00 0 5 10 15 0 5 10 15 62 International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 Threshold based Palladium rate variance Threshold based Crude rate Variance 97.00 96.00 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00 95.00 94.00 93.00 92.00 91.00 90.00 89.00 88.00 0 5 10 15 0 5 10 15 Threshold based Platinum rate variance 1,860.00 1,840.00 1,820.00 1,800.00 1,780.00 1,760.00 1,740.00 1,720.00 1,700.00 1,680.00 1,660.00 0 2 4 6 8 10 12 CONCLUSION AND FUTURE WORK This research work proposed Arti-Q based Dynamic web page change detection for selected zone helps us to reduce browsing time. The Web Page Detection system is for selection zone using generalized Technique detects changes for selected zone. It detects changes for text and attributes at minute level. It shows the output of change text in red colour. This saves the time of detecting the changes in the entire Web page.From the analysis of various research papers it is concluded that using a web page change detection system saves the user browsing time. Every algorithm plays some role in improving the performance of previous algorithm. Detection of changes in structure, content, presentation, behaviour of the web page helps the user to know about the updated information.The algorithm for checking page update parameters is designed to show even the smallest of change in a web page and the values stored for the parameters are distinct for small details as well.As a future work various performance parameters like running time, complexity of computation, number of nodes, depth of tree, load balancing of user requests for notification in various algorithms 63 International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 can be measured and improved. Algorithms can also be implemented in dynamic web pages or in password protected web sites. REFERENCES [1].D. Hirschberg, ―Algorithms for the longest common subsequence problem,‖ Journal of the ACM, vol. 24, no. 4, pp. 664–675, 1997. [2].F.Douglis, et al., ―The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web,‖ WWW, vol. 1, no. 1, pp. 27–44, Jan. 1998. [3]. S.Brin and L. Page, ―The Anatomy of a Large-Scale Hyper textual Web Search Engine.‖ Computer Networks and ISDN Systems, vol. 30, nos. 1-7, pp. 107-117, 1998. [4]. Heydon and Najork.Mercator: ―A scalable, extensible Web crawler‖,World wide web2 (4):219- 229,1999. [5]. L. Liu, C. Pu, and W. Tang, ―WebCQ—Detecting and Delivering Information Changes on the Web,‖ Proc. Ninth Int‘l Conf Information and Knowledge Management, pp. 512-519, 2000. [6]. G. Cobena, S. Abiteboul, and A. Marian, ―Detecting Changes in XML Documents,‖ Proc. 18th Int‘l Conf. Data Eng., pp. 41-52, 2002. [7]. S. Flesca, E. Masciari, Efficient and effective web change detection, Data and Knowledge Engineering 46 (2) (2003) 203–224. [8]. Song, Bhowmick. ―BioDIFF An Effective Fast Change Detection for Genomic and Proteomic Data‖ in a Proceedings of the Thirteenth ACM conference on Information and knowledge management., Pp146- 147, Nov. 2004. [9]. S. Chakravarthy, Subramanian, ―Automating change detection and notification of web pages‖ , Proc 17th Int‘l Conf. on DEXA, IEEE , 2006. [10]. D.Yadav,A.K. Sharma, J.P. Gupta‖Change Detection In Web page‖ in a proceeding of 10th international conference on information technology,pp 265-270,2007. [11]. I. Khoury, Student Member, IEEE, R. M. El-Mawas, Student Member, IEEE,O. ElRawas, Elias F. Mounayar, and H. Artail, Member, IEEE ,An Efficient Web Page Change Detection System Based on an Optimized Hungarian , in IEEE Transactions on knowledge and data engineering, VOL. 19, NO. 5,pp 599-612, MAY 2007. [12]. Varshney Naveen Kumar and Dilip Kumar Sharma ―A Novel Architecture and Algorithm for Web Page Change Detection‖ IEEE IACC, 782-787, 2013. 64 International Journal of Computer Application (2250-1797) Volume 5– No. 7, December 2015 [13]. K.S. Kuppusamy *, G.Aghila, CaSePer: ―An efficient model for personalized web page change detection based on segmentation‖ Elsevier, Department of Computer Science, School of Engineering and Technology, Pondicherry University, Pondicherry, India (2014) 26, 19–27. R.Kousalya M.C.A M.Phil (Ph.D),Assistant Professor, Head Department of Computer Application, Dr.N.G.P Arts and Science College Coimbatore, India. E-mail id: [email protected] J.Rubana Priyanga M.phil Research Sholar, Dr.N.G.P Arts and Science College Coimbatore, India. E-mail id: [email protected]. 65