Download WEB PAGE CHANGE DETECTION THROUGH MACHINE LEARNING ALGORITHMS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
WEB PAGE CHANGE DETECTION THROUGH
MACHINE LEARNING ALGORITHMS
R.Kousalya MCA., M.Phil. (Ph.D)
Head Department of Computer Applications
Dr.N.G.P Arts and Science College, Coimbatore.
[email protected]
J.Rubana Priyanga M.Phil Scholar
Dr.N.G.P Arts and Science College, Coimbatore.
[email protected]
ABSTRACT
This research work is mainly used to detect changes in the web page. The existing Case
per technique is taken for research and compared with the proposed technique Arti-Q
Dynamic web page change detection. Artigence technique is an Arti-Q algorithm, used for
reducing response time for each request. The Arti-Q technique introduces two heuristics
method such as internal proxy and external proxy. The methodology used for web page
change detection are Net Tracker which is used to track the commodity market server to
detect the commodity rate changes in Gold, silver, INR, Palladium, Platinum, Rhodium and
Crude. The proposed Arti-Q DWPD technique receives the URL as a parameter to detect the
HTML source code and using substring comparison to filter out the commodity rate changes.
The detected rates are compared with the lastly retrieved rate and if there is any change
identified between them, it will be displayed in the list view control and for every change
detection, the varying count gets increment the value by 1.The process repeated until the
threshold time exceed as for every 0.5 seconds. The experimental results shows that the
implementation improves an accuracy level while compared with existing implementation.
Keywords: Case per technique, Arti-Q DWPD technique, Net Tracker, internal proxy and
external proxy.
INTRODUCTION
Web mining is a type of data mining is used to extract web data from web pages. As
data mining basically deals with the structured form of data, web mining deals with the
unstructured and semi structured form of data. Web mining consist of three techniques such
as web content mining, web structure mining and web usage mining for web data extraction.
The World Wide Web has evolved from a simple and static content source into a richly
dynamic information deliver channel. Hence, tracking the changes in web pages has become
an interesting research problem would solve the problem and facilitate users to efficiently
utilize the information that is provided by web pages.
57
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
WEB USAGE MINING IN DYNAMIC WEB PAGE CHANGE DETECTION
The dynamic web page change detection is processed under the web mining concepts,
and the web usage mining is the basic component being used to detect changes in web page.
Web usage mining is the application of data mining techniques to discover interesting usage
patterns from web usage data, in order to understand and better serve the needs of web-based
applications. Web usage mining is a process of mining useful information from server logs.
Usage data captures the identity or origin of web users along with their browsing behavior at
a web site.
Web usage mining can be classified further depending on the kind of usage data
considered as web server data, application server data and application level data. In Web
Server Data, user logs are collected by the web server and typically include IP address, page
reference and access time. In Application Server Data, commercial application servers such as
Web logic, Story Server, have significant features to enable E-commerce applications to be
built on top of them with little effort. A key feature is the ability to track various kinds of
business events and log them in application server logs. In Application Level Data, new kinds
of events can be defined in an application, and logging can be turned on for them, generating
histories of these events. It must be noted, however, that many end applications require a
combination of one or more of the techniques can be applied.
Types of Web Mining
CHANGE DETECTION AND NOTIFICATION
Change detection and notification (CDN) refers to automatic detection of changes made
to World Wide Web pages and notification to interested users by email or other means.
Whereas search engines are designed to find web pages, CDN systems are designed to
monitor changes to web pages. Before change detection and notification, it is necessary for
users to manually check for web page changes, either by revisiting web sites or periodically
searching again. Efficient and effective change detection and notification is hampered by the
fact that most servers do not accurately track content changes through Last-Modified headers.
58
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
PROBLEM DEFINTION
Due to limited network and computational resources, it is often difficult to monitor the
sources constantly to check for changes and to download changed data items to the copies.
Finding the changes between different versions of the web pages is the core operation of web
page change detection system. Users who visit a web page repeatedly at frequent intervals are
more interested in knowing the recent changes that have occurred on the page than the entire
contents of the web page. Because of the increased dynamism of web pages, it would be
difficult for the user to identify the changes manually. The main objective is to reduce the
complexity of the change-detection by focusing only on the web page change identification in
which the changes have occurred.
RELATED WORK
In this paper, L.Liu, C.Pu and W. Tang, offers personalized delivery of change
notifications as well as summarization and prioritization of Web page changes. One
drawback of Web CQ is that it detects changes only between the last two versions of a
HTML page. This algorithm is efficient in speed and memory space. It uses operations such
as change node, delete node and insert node. Delta is constructed to find the matching of
nodes between two trees.The use of XML specificities in algorithm leads to significant
improvements. Drawback of this algorithm is that there is some loss of quality. Another
drawback is that there is need of gathering more statistics about the size of deltas and in
particular for real web data.
In this paper, Varshney Naveen Kumar, Dilip Kumar Sharma proposed a Tree based
approach is good for comparing the nodes of both the tree as old web page tree and modified
web page tree. It gives the relevancy to the web pages and notifies the user about detecting the
changes. For detecting structural change document tree is constructed and then signature
values assigned to the root nodes and child nodes of the old and new web pages are compared.
This algorithm defines the good comparison study for the different algorithms and provides
simple method for detecting changes.
In this paper, K.S. Kuppusamy, G.Aghila, proposed a CaSePer Model which is Change
detection based on Segmentation with Personalization an enhanced model for detecting
changes in the pages. The change detection is micro-managed by introducing web page
segmentation. The web page change detection process is made efficient by having it perform
a dual-step process. Changes in web pages can either be content changes or structural
changes. The structural changes are primarily changes that occur at the template level. The
content changes include modification, insertion or deletion of hypertext elements. Users
navigate to web pages that they are interested in. A user interested in a specific topic might
visit web pages that are related to that topic at regular intervals over time. Such users might
be interested in knowing the recent changes that have occurred in that web page rather than
seeing the entire web page.
59
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
SYSTEM ARCHITECTURE
The proposed work includes processing functionalities and interfaces for processing user
request, fetching web pages from the internet allowing users to select zone in web pages to
monitor. This method performs well and is able to detect the structural as well as content
level changes even at the minute level and helps to locate minor or major changes within the
selected zone of document. The properties of HTML page to extract the HTML source using
http web request function which passes the website address as parameter where the function
gets the Uniform Resource Locator (URL) of the internet resource that actually responds to
the request. A web crawler could be a system for the bulk downloading of websites. A
crawler starts off by placing an initial set of URLs, in a queue, where all URLs to be retrieved
are kept and prioritized. The crawler gets a URL in some order from this queue, downloads
the page, extracts any URLs within the downloaded page, and then in the queue it puts the
new URLs. Artigence is the technique which uses the concept called Arti-Q DWPD method
to reduce the waiting time request of web pages and calculate the threshold value.
Input unit
Fetch unit
Web
crawler
HTML TO
XML
Converter
Selection of
zone in web
page
Presentation
and storage
module
Highlighted
output of web
page
CC
Tree Builder
for selected
Zone
Comparator
ARTI- Q
Web Page Change Detection System
ARTI-Q ALGORITHM
Step 1: Create a two 2-dimensional array and named it as External proxy Ep(i, j) and internal
proxy Ip(i,j), where ‗i‘ represents Queue Number and ‗j‘ represents substring data which
holds silver, platinum, palladium, crude, rhodium and INR value.
Step 2: Create an Arti-Q Timer Control ‗T‘ in Ip (i, j) with interval set to 0.5 seconds.
Step 3: Initialize predefined syntax Ps, variant V =0
60
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
Step 4: For every ‗T‘ hits 0.5 second
Read Ep (i, j) extracted from the web page
Compare Ip (i, j) = Ep(i,j), if the variant detects, then increment V = V +1
Ep (i, j) =0 (Reset External proxy counter to zero).
Step 5: repeat Step 4.
RESULT ANALYSIS
The proposed methodologies on web page change detection have been effectively used
in the Arti-Q Dynamic Web Page Change Detection with effective techniques. The output for
the given query is processed and implemented in WEKA TOOL. The page source is
extracted using the DWPD method. The URL is passed as a parameter in the direct cast
method and using GET function, the page source is extracted and the value is allowed to
store in a temporary array. A timer control is implemented where the page source is extracted
for every 30 seconds and compare with the temporary array whether any changes in the
content is made or not. If any change detected, then the changed value has been displayed to
the user. The result shows the given output from various processes is better than existing
process. The result has been analyzed.
Arti-Q-DWPD
Duration
No. of Variants
Duplicates Detected
1-30 minutes
62
0
31-60 minutes
57
0
61-90 minutes
58
0
91-120 minutes
49
0
Number of variants
ArtiQ - DWPD
100
50
0
No. of Variants
1-30
minutes
62
31-60
minutes
57
61-90
minutes
58
91-120
minutes
49
(Y-axis no.of variants-axis time seconds)
Evaluated Arti-Q DWPD
61
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
From the above Arti-Q DWPD table, it is clearly stated in the first 1-30 minutes 62
variants detected in the page change detection with no duplicates identified; similarly in the
next 31-60 minutes, 57 variants detected with no duplicates identified. In the next 61-90
minutes, 79 variants detected with no duplicates identified and in the last 91-120 minutes, 65
variants detected with no duplicates identified.
No. of Variants
COMPARISON OF VARIANTS DETECTED BETWEEN
EXISTING AND PROPOSED
80
60
40
20
0
1-30 minutes
CasePer
75
31-60
minutes
68
Arti-Q
62
57
61-90
minutes
79
91-120
minutes
65
58
49
The experimental result shows that our proposed approach provides higher
performance rate while compare with conventional methods. The threshold values exceed for
every 0.5 seconds and it list the value of data rate changes for every 0.5 seconds.
Threshold based silver
Rate Variance
Threshold based Gold
Rate variance
₹ 1,540.00
35.00
₹ 1,520.00
34.00
₹ 1,500.00
33.00
₹ 1,480.00
32.00
₹ 1,460.00
31.00
₹ 1,440.00
30.00
₹ 1,420.00
29.00
₹ 1,400.00
28.00
0
5
10
15
0
5
10
15
62
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
Threshold based
Palladium rate
variance
Threshold based Crude
rate Variance
97.00
96.00
900.00
800.00
700.00
600.00
500.00
400.00
300.00
200.00
100.00
0.00
95.00
94.00
93.00
92.00
91.00
90.00
89.00
88.00
0
5
10
15
0
5
10
15
Threshold based Platinum rate variance
1,860.00
1,840.00
1,820.00
1,800.00
1,780.00
1,760.00
1,740.00
1,720.00
1,700.00
1,680.00
1,660.00
0
2
4
6
8
10
12
CONCLUSION AND FUTURE WORK
This research work proposed Arti-Q based Dynamic web page change detection for
selected zone helps us to reduce browsing time. The Web Page Detection system is for
selection zone using generalized Technique detects changes for selected zone. It detects
changes for text and attributes at minute level. It shows the output of change text in red
colour. This saves the time of detecting the changes in the entire Web page.From the analysis
of various research papers it is concluded that using a web page change detection system
saves the user browsing time. Every algorithm plays some role in improving the performance
of previous algorithm. Detection of changes in structure, content, presentation, behaviour of
the web page helps the user to know about the updated information.The algorithm for
checking page update parameters is designed to show even the smallest of change in a web
page and the values stored for the parameters are distinct for small details as well.As a future
work various performance parameters like running time, complexity of computation, number
of nodes, depth of tree, load balancing of user requests for notification in various algorithms
63
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
can be measured and improved. Algorithms can also be implemented in dynamic web pages
or in password protected web sites.
REFERENCES
[1].D. Hirschberg, ―Algorithms for the longest common subsequence problem,‖ Journal of
the ACM, vol. 24, no. 4, pp. 664–675, 1997.
[2].F.Douglis, et al., ―The AT&T Internet Difference Engine: Tracking and Viewing
Changes on the Web,‖ WWW, vol. 1, no. 1, pp. 27–44, Jan. 1998.
[3]. S.Brin and L. Page, ―The Anatomy of a Large-Scale Hyper textual Web Search Engine.‖
Computer Networks and ISDN Systems, vol. 30, nos. 1-7, pp. 107-117, 1998.
[4]. Heydon and Najork.Mercator: ―A scalable, extensible Web crawler‖,World wide web2
(4):219- 229,1999.
[5]. L. Liu, C. Pu, and W. Tang, ―WebCQ—Detecting and Delivering Information Changes
on the Web,‖ Proc. Ninth Int‘l Conf Information and Knowledge Management, pp. 512-519,
2000.
[6]. G. Cobena, S. Abiteboul, and A. Marian, ―Detecting Changes in XML Documents,‖
Proc. 18th Int‘l Conf. Data Eng., pp. 41-52, 2002.
[7]. S. Flesca, E. Masciari, Efficient and effective web change detection, Data and
Knowledge Engineering 46 (2) (2003) 203–224.
[8]. Song, Bhowmick. ―BioDIFF An Effective Fast Change Detection for Genomic and
Proteomic Data‖ in a Proceedings of the Thirteenth ACM conference on Information and
knowledge management., Pp146- 147, Nov. 2004.
[9]. S. Chakravarthy, Subramanian, ―Automating change detection and notification of web
pages‖ , Proc 17th Int‘l Conf. on DEXA, IEEE , 2006.
[10]. D.Yadav,A.K. Sharma, J.P. Gupta‖Change Detection In Web page‖ in a proceeding of
10th international conference on information technology,pp 265-270,2007.
[11]. I. Khoury, Student Member, IEEE, R. M. El-Mawas, Student Member, IEEE,O. ElRawas, Elias F. Mounayar, and H. Artail, Member, IEEE ,An Efficient Web Page Change
Detection System Based on an Optimized Hungarian , in IEEE Transactions on knowledge
and data engineering, VOL. 19, NO. 5,pp 599-612, MAY 2007.
[12]. Varshney Naveen Kumar and Dilip Kumar Sharma ―A Novel Architecture and
Algorithm for Web Page Change Detection‖ IEEE IACC, 782-787, 2013.
64
International Journal of Computer Application (2250-1797)
Volume 5– No. 7, December 2015
[13]. K.S. Kuppusamy *, G.Aghila, CaSePer: ―An efficient model for personalized web page
change detection based on segmentation‖ Elsevier, Department of Computer Science,
School of Engineering and Technology, Pondicherry University, Pondicherry, India (2014)
26, 19–27.
R.Kousalya M.C.A M.Phil (Ph.D),Assistant Professor,
Head Department of Computer Application, Dr.N.G.P Arts
and Science College Coimbatore, India. E-mail id:
[email protected]
J.Rubana Priyanga M.phil Research Sholar, Dr.N.G.P Arts
and Science College Coimbatore, India.
E-mail id:
[email protected].
65