Download International Journal of Computer Science and Intelligent

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Research in Computer and ISSN (Online) 2278- 5841
Communication Technology, Vol 3, Issue 6, June- 2014 ISSN (Print) 2320- 5156
A Review of Various Techniques of Web Content Mining For
HTML and XML Contents
Rupinder Kaur1
Kamaljit Kaur2
Student of M.Tech
Department of Computer Science Engineering
Sri Guru Granth Sahib World University
Fatehgarh Sahib, Punjab, India
Assistant Professor
Department of Computer Science Engineering
Sri Guru Granth Sahib World University
Fatehgarh Sahib, Punjab, India
Abstract— World Wide Web is the largest source of
information. Most of the data on the web is dynamic and is
in unstructured form. It is becoming difficult to get the
relevant data from the web. Data Mining is the field of
computer science which is used to extract knowledge from
very large amount of data. Web mining is the application
of data mining, which implements various techniques of
data mining to get the efficient knowledge from the web
data. This paper presents an overview of various
techniques that has been used for web content mining
including images, audio, video and semi-structured
contents like HTML and XML. Since HTML has many
limitations like limited tags, not case sensitive and
designed to display data only, Web developers has started
to develop Web pages on emerging Web Technologies like
XML, Flash etc. XML was designed to describe data and
to focus on what the data is. XML also plays the role of a
meta- language and allows document authors to create
customized markup language for limitless different types
of documents, making it a standard data format for online
data exchange.
efficiently to fetch web data. It is very wide area of
research for the researchers because of the growing use of
the web. Moreover, most of the data on the web is in
unstructured form i.e. in the form of free text, images,
audio, video and semi-structured documents like HTML
and XML etc. Since HTML has many limitations like
limited tags, not case sensitive and designed to display
data only, Web developers has started to develop Web
pages on emerging Web Technologies like XML, Flash
etc. [4] XML was designed to describe data and to focus
on what the data is. XML also plays the role of a meta
language and allows document authors to create
customized markup language for limitless different types
of documents, making it a standard data format for online
data exchange. This growing use has raised need for better
tools and techniques to perform mining on XML too. [3]
This paper reviews the various techniques that have been
used to mine the XML and HTML data on the web.
[email protected]
Keywords - Data Mining, Web Mining, Web Content
Mining, XML, Vector Space Model, FP-tree.
a)
I.
INTRODUCTION
Web is the largest source of information. As the
number of documents grows, searching for information is b)
turning into a cumbersome and time consuming operation.
Data Mining is the field of computer science which is used
to extract information from the large amount of data. Web
Mining is the application of data mining which is used to
generate patterns from the web. Patterns must be such that
they are easily understandable, useful and novel. Various c)
techniques of data mining are used to extract the
information from the web. Not only data mining but also
other tools from fields of artificial intelligence, machine
learning, natural language processing can also be used
www.ijrcct.org
[email protected]
Web Mining on the basis of type of data to be explore
can
be
classified
into
three
main categoriesWeb Content Mining
Web content mining is the process of mining of contents
of a web page. Contents of a web page may include text,
image, audio, video and semi-structured records in the
form of html and xml contents which are either embedded
in the web page or having links to other pages.
Web Structure Mining
The Web page structure consists of a Web page as a node
and hyperlinks as edges connecting to other pages. [6] In
other words, it works in the form of graphs. It focuses on
the connectivity of the web site to other sites that are
called as hyperlinks.
Web Usage Mining
It is the mining of the user preferences while user is
navigating through the websites. This is done by applying
the mining process on the log files repository. [6] By web
usage mining, commercial websites take advantage of
Page 669
International Journal of Research in Computer and ISSN (Online) 2278- 5841
Communication Technology, Vol 3, Issue 6, June- 2014 ISSN (Print) 2320- 5156
knowing the usage pattern of customer, their behaviour
and frequency of their visits.
II.
WEB CONTENT MINING
Web content mining is the mining, extraction and
integration of useful data, information and knowledge from
Web page contents. Content mining is the scanning and
mining of text, pictures and graphs of a Web page to
determine the relevance of the content to the search query.
With the massive amount of information that is available on
the World Wide Web, content mining provides the results lists
to search engines in order of highest relevance to the
keywords in the query retrieval view. [10] The use of the Web
as a provider of information is unfortunately more complex
than working with static databases. Because of its very
dynamic nature and its vast number of documents, there is a
need for new solutions that are not depending on accessing the
complete data on the outset. Another important aspect is the
presentation of query results. Due to its enormous size, a web
query can retrieve thousands of resulting webpages. Thus
meaningful methods for presenting these large results are
necessary to help a user to select the most interesting content.
Various web contents that can be mined are text, image,
audio, video and structured or semi-structured records. For the
semi-structured records all the work utilizes the HTML and
XML structure of the web page. These are explained as
belowA. Animation
These can be created using GIF images or using Flash, Ajax,
or other animation tools.
B. Images
Images are the most common way to add multimedia to
websites. One can use photos or clip art or even drawings.
C. Sound
Sound is embedded in a Web page so that readers hear it when
they enter the site or when they click a link to turn it on.
D. Video
Video is getting more and more popular on Web pages. But it
can be challenging to add a video so that it works reliably
across different browsers. [13]
E. Hyper Text Markup Language
Hyper Text Markup Language by Berners-Lee in mid-1993 is
still the dominant way whereby we share content. And while
there are many web pages with localized proprietary structure
(most usually, business websites), many millions of websites
abound that are structured according to a common core idea of
HTML. Blogs are a type of website that contains mainly web
pages authored in html. Search engine sites are composed
mainly of html content, but also have a typically structured
approach to revealing information. Discussion boards are sites
composed of "textual" content organized by html or some
variation that can be viewed in a web browser. Ecommerce
sites are largely composed of textual material and embedded
www.ijrcct.org
with graphics displaying a picture of the item(s) for sale.
However, there are extremely few sites that are composed
page-by-page using some variant of HTML. [14]
F. Xtensible Markup Language
In an attempt to standardize the format of data exchanged over
the Web and to achieve interoperability between the different
technologies and tools involved, World Wide Web
Consortium (W3C) introduced eXtensible Markup Language
(XML). XML is a simple but very flexible text format derived
from Standard Generalized Markup Language (SGML), and
has been playing an increasingly important role in the
exchange of wide variety of data over the Web. Even though it
is a markup language much like the Hyper Text Markup
Language (HTML), XML was designed to describe data and
to focus on what the data is, whereas HTML was designed to
display data and to focus on how the data looks on the Web
browser. [3] A data object described in XML is called an
XML Document. XML also plays the role of a metalanguage, and allows document authors to create customized
markup languages for limitless different types of documents,
making it a standard data format for online data exchange.
This growing usage of XML has naturally resulted in the
increasing amount of available XML data which raises the
pressing need for more suitable tools and techniques to
perform data mining on XML documents and XML
repositories.
III.
RELATED WORK
Various techniques have been used to fetch the web data
till now. Besides semi-structured contents of a web page,
the unstructured contents like images, audio, videos etc.
are fetched using different algorithms. These are reviewed
in [12] with their various pros and cons. Review of
techniques that have been used for XML content mining
are explained as under:
A. XML Algebras for Data Mining
The prerequisite of designing XML mining algebras is
that all generalized knowledge must be expressed in XML,
i.e. input and output of operators must have the same form.
Extension of existing XML algebras by adding some mining
operators might work well as a mining algebra. Each operator
is associated with an algorithm for the mining task. [5] A
further development is to decompose mining tasks into several
sub-tasks. These sub-tasks are shared among different mining
tasks. All operators should have the following properties:
1. The operation of each operator should be clearly defined
and easily implemented. 2. The input and output of each
operator should be in the same form.
The ultimate goal of XML algebras is to provide operational
procedure to obtain answers of user’s queries.
Advantages and Disadvantages: How to decompose the
mining tasks into sub-tasks is still an open problem, and it is
largely dependent on the nature of the mining tasks.
Page 670
International Journal of Research in Computer and ISSN (Online) 2278- 5841
Communication Technology, Vol 3, Issue 6, June- 2014 ISSN (Print) 2320- 5156
B. Vector Space Model
The VSM is a special type of kernel methods for textual
data. It transforms a document into a numerical vector, each
element of which is an indicator whether or not the document
contains a certain word (or phrase). A VSM can be
represented either in a boolean vector model, or in a term
weighting model, which has scalar values that take into
account the appearance frequency of a term in a set of
documents. [5] A widely used weighting method is ’Term
Frequency-Inverse Document Frequency (TF-IDF)’, in which
the weight of the ith term in the jth document, denoted as wij,
is defined by wij = TFij × IDFi = TFij × log (n/DFi), where
TFij is the number of occurrences of the ith term within the jth
document, and DFi is the number of documents (out of n) in
which the term appears.
Advantages and Disadvantages: It requires too much
computation for real world documents.
C. Kernel-based XML Mining
The kernel methods for structured data are applicable to
XML mining in various ways. The string kernels with minor
modifications are used to compare two pieces of context
information (i.e., successive element labels in a path).
Undoubtedly, the VSM also provides the same state of-theart performance in XML content mining as in traditional text
mining. The tree kernels are used to measure the structural
similarity between XML schema documents and between
XML instance documents. By transforming the tree structure
into a string (e.g., by a depth-first traversal), the string
kernels and the VSM are also used to compute the structural
similarity.
Advantages and Disadvantages: The kernels are used to
measure the structural similarity between XML schema
documents and between XML instance documents by
transforming the tree structure into a string. Since kernels are
mainly developed for bio/chemistry -informatics, new variant
kernels for texts (and XML contents) are required.
D. First FP-Growth Algorithm
In [5] Nassem et al. constructs an FP-tree structure and
mines frequent patterns by traversing the constructed FP-tree.
The FP-tree structure is an extended prefix-tree structure
involving crucial condensed information of frequent patterns.
In addition, some features have been suggested that need to be
added into XQuery in order to make the implementation of the
First Frequent Pattern growth more efficient. In future work of
paper it was planned to implement other standard data mining
algorithms which can be expressed in XQuery to improve the
performance of the results.
Advantages and Disadvantages: This method is advantageous
because, it doesn’t generate any candidate items. It is
disadvantageous because, it suffers from the issues of special
and temporal locality issues.
www.ijrcct.org
Following are experimental results of the parameters that are
calculated for FP-Growth Algorithm when executed on
HTML and XML contentsTABLE 1
S.No
URLs
Time
Precision
Recall
Taken
FMeasure
(ms)
1
https://in.yahoo.
2
http://www.ama
3
https://www.fac
4
http://www.eba
5
http://in.msn.co
6
http://www.hom
7
http://www.flip
8
http://adaptive-
9
http://tanishq.co
10
http://www.natu
7000
0.0208
0.247
0.7422
7000
0.1818
0.131
0.395
4000
0.2
0.0769
0.230
6000
0.0256
0.0763
0.2289
7000
0.0232
0.1575
0.4725
6000
0.0769
0.2363
0.7090
5000
0.0181
0.0516
0.155
1000
0.0714
0.3043
0.9130
7000
0.0909
0.0973
0.2920
6000
0.0303
0.2773
0.8319
com/?p=us
zon.in/
ebook.com/
y.in/
m/
eshop18.com/
kart.com/
images.com/
.in/Home
rewallpaper.net/
Results of FP-Tree Algorithm
E. Canonical Tree (CAN Tree)
It is a tree structure that arranges or orders the nodes of a
tree in some canonical order. It follows a tree-based
incremental mining approach. Like FP-tree approach, there is
no need to rescan the transactional database when it is
updated. Because of following the canonical order, frequency
changes (if any) due to incremental updates like insertion,
deletion and modification of the transactions will not affect
the ordering of the nodes in CAN tree. [8] After constructing
the CAN tree, the frequent patterns can be mined from the
tree. After constructing the CAN-tree, we have to mine the
frequent patterns by traversing the tree in an upward direction.
This can be done similar to FP-growth by constructing a
header table and finding only the frequent items.
Advantages and Disadvantages: This method is advantageous
because it supports incremental updates without any major
changes in the tree.
Page 671
International Journal of Research in Computer and ISSN (Online) 2278- 5841
Communication Technology, Vol 3, Issue 6, June- 2014 ISSN (Print) 2320- 5156
F. COFI-Tree
COFI is much faster than FP-Growth and requires
significantly less memory. The idea of COFI is to build
projections from the FP-tree each corresponding to sub
transactions of items co-occurring with a given frequent item.
These trees are built and efficiently mined one at a time
making the footprint in memory significantly small. The COFI
algorithm generates candidates using a top down approach,
where its performance shows to be
severely affected while mining databases that has potentially
long candidate patterns that turns to be not frequent, as COFI
needs to generate candidate sub patterns for all its candidates
patterns and also build upon the COFI approach to find the set
of frequent patterns but after avoiding generating useless
candidates. [8] The basic idea of algorithm is based on the
notion of maximal frequent patterns. A frequent item set X is
said to be maximal if there is no frequent item set X’ such that
X€ X’.
Advantages and Disadvantages: This method is advantageous
because it requires only one scan of the database and the trees
are ordered according to their local frequency in the paths. It
is disadvantageous because lot of computation is required to
build the tree.
G. Improved Frequent Pattern Tree Algorithm
In [9] Mishra et al. have proposed an improved technique
that extract association rules from XML documents without
any preprocessing or post processing. The proposed improved
algorithm for mining the complete set of frequent patterns
used patterns fragment growth. First Frequent Pattern-tree
based mining adopts a pattern fragment growth method to
avoid the costly generation of large number of candidate sets
and a partition based, divide-and-conquer method is used.
They proposed an association data mining tool for XML data
mining. It will increases the mining efficiency and also takes
less memory.
Advantages and Disadvantages: It recovers the weakness of
some traditional data mining algorithms. It is difficult to
implement for complex structures of XML.
IV.
CONCLUSION
Web Mining is the application of data mining which
aims to extract knowledge from the web data. As advanced
technology is emerging, web sites are now developing
with newer and safer methods like with XML, flash
techniques. In this paper various techniques for XML and
HTML mining are reviewed. Some of these techniques are
executed and results are also provided. As web is growing
rapidly and new advancements are coming on daily basis,
there is need of reliable tools in order to fetch web data.
www.ijrcct.org
Most of the mining work is done on HTML contents.
However as XML is a new technology and its usage is
increasing, it is need of the time to build efficient methods
for XML contents mining too.
ACKNOWLEDGMENT
As a part of my course I have reviewed “A Review of
Various Techniques of Web Content Mining for HTML and
XML Contents”. I am very thankful to Ms. Kamaljit Kaur,
Assistant Professor, Sri Guru Granth Sahib World
University, Fatehgarh Sahib for giving me such a valuable
support in doing my work. Last but not least, a word of
thanks for the authors of all those books and papers which
I have consulted during my work.
REFERENCES
[1] O. Etzioni, "The world wide Web: Quagmire or gold mine.”
Communications of the ACM, Vol. 39 No. 11, pp. 65-68, Nov. 1996.
[2] R. Kosala, H. Blockeel “Web mining research: A survey,” ACM
SIGKDD Explorations, Vol. 2 No. 1, pp. 1-15, June 2000.
[3] Qin Ding and Gnanasekaran Sundarraj, “Association Rule Mining
from XML Data”, Conference on Data Mining | DMIN'06 |.
[4] Krishna Murthy. A, Suresha, “XML: URL Data Set Creation for
Future Web Mining Research Avenues”, International Journal of
Computer Applications (0975 – 8887), 2011.
[5] Mohammad. Nassem, Prof. Sirish Mohan Dubey, “First Frequent
Pattern-tree based XML pattern fragment growth method for Web
Contents”, International Journal of Advanced Research in Computer
Science and Software Engineering, Volume 2, Issue 1, January 2012.
[6] Ganesh Dhar, Govind Murari Upadhyay, “Web Mining: Concepts and
Decision Making Aid”. Volume 2, Issue 7, July 2012.
[7] Darshna Navadiya, Roshni Patel, “Web Content Mining TechniquesA Comprehensive Survey”, IJERT, Vol. 1 Issue 10, December- 2012.
[8] Amit Kumar Mishra, Hitesh Gupta, “A Recent Review on XML data
mining and FFP”, International Journal of Engineering Research,
Volume No.2, Issue No.1, pp : 07-12
[9] Amit Kumar Mishra, Hitesh Gupta, “A Novel FP-Tree Algorithm for
Large XML Data Mining”, International Journal of Engineering
Sciences and Research Technology, [Mishra, 2(3): March, 2013].
[10]Abdelhakim Herrouz, Chabane Khentout Mahieddine Djoudi,
“Overview of Web Content Mining Tools”, The International Journal
of Engineering And Science, Volume 2, Issue 6, pp. 106-110, June
2013.
[11] Buhwan Jeong, Daewon Lee, Jaewook Lee, and Hyunbo Cho, “Towards
XML Mining: The Role of Kernel Methods”.
[12] D. Jayalatchumy, Dr. P.Thambidurai, “Web Mining Research Issues and
Futur Directions – A Survey”, IOSR Journal of ComputerEngineering
(IOSR-JCE), Volume 14, Issue 3 (Sep. - Oct. 2013), pp. 20-27.
[13] http://en.wikipedia.org/wiki/Web_content.
[14] http://webdesign.about.com/od/content/qt/what-is-web-content.htm.
Page 672