Download Web Mining: An Introductory approach to Increase the Performance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Volume 2, Issue 2, March-April 2012
Available Online at www.gpublication.com/jcer
ISSN No.: 2250 - 2637
©Genxcellence Publication 2011, All Rights Reserved
RESEARCH PAPER
Web Mining: An Introductory approach to Increase the Performance of Web
Site Using Mined pattern
Arun Kumar Singh*1 and Manish Kumar Singh2
1
Associate Professor IIMT Engineering College Meerut (U.P.) India
[email protected]
2
Assistant Professor IIMT Institue of Engineering & Technology Meerut (U.P.) India
[email protected]
Abstract
The World-Wide-Web contains a large amount of information. Everyone can store and retrieve the information from web. It
is difficult to find the relevant piece of information from web. Extracting the important information from web is called Web
Mining. Web mining technologies are best suited for web information extraction and information retrieval. Web mining is one
of the mining technologies, which applies data mining techniques in large amount of web data to improve the web services. We
are going to give a brief description of web mining and its categorization namely: web content mining, web structure mining
and web usage mining. This paper also reports the web data mining with applications.
Keywords
Web Mining, Information Extraction, Information Retrieval, Web content mining, Web structure mining, Web usage mining and
Web crawling
INTRODUCTION
The World Wide Web is a popular and interactive medium
to disseminate information today. With the explosive
growth of information sources available on the World
Wide Web, it has become increasingly necessary for users
to utilize automated tools in order to find, extract, filter,
and evaluate the desired information and resources. The
World Wide Web provides a vast source of information of
almost all types, ranging from DNA databases to resumes
to lists of popular multiplexes. Web has a large amount of
data and it is not easy task to find out the content or
information of our interest. Web mining is one of the
techniques to solve such kind of problem. We are not
saying that this is the only technique, a no. of technique
are namely Machine Learning, Natural Language
Processing etc.
Due to the large availability of data the World Wide Web,
it has become very important for users to use automated
tools to find the desired information resources.
Information Retrieval is the automatic retrieval of all
relevant documents while at the same time retrieving as
few of the non-relevant as possible. Information extraction
aims to extract relevant facts from the documents while
aims to select relevant documents [1].
As shown is Figure (1) YAHOO, GOOGLE and MSN are
search engines, used to extract the information from web.
The extracted information may be relevant but also
contain less relevant, and some time irrelevant
information.
documents, usage logs of web sites, etc. to improve the
web services[2]. Web mining refers to the overall process
of discovering potentially useful and previously unknown
information or knowledge from the Web data. A natural
combination of Data Mining and World Wide Web may be
referred to as Web Mining.
Figure (1)
Web mining is the Data Mining technique that
automatically discovers or extracts the information from
web documents. It consists of following tasks [3]:
Resource finding: It involves the task of retrieving
intended web documents. It is the process by which we
extract the data either from online or offline text resources
available on web. It includes information retrieval and
extraction from web pages.
 .Information selection and pre-processing: It involves the
WEB MINING
automatic selection and pre processing
of specific
Web mining is the application of data mining techniques information from retrieved web resources. This process
to extract useful information and knowledge from web transforms the original retrieved data into information.
data, including web documents, hyperlinks between Making web data suitable for mining is preprocessing.
9
Please Cite this Article at: Arun Kumar Singh et al, Journal of Current Engineering Research, 2 (2),March-April 2012,09-12
 Generalization: It automatically discovers general patterns
at individual web sites as well as across multiple sites.
Data Mining techniques and machine learning are used in
generalization. Result displayed in a web search is
aggregation of multiple web documents.
 Analysis: It involves the validation and interpretation of
the mined patterns. It plays an important role in pattern
mining. A human plays an important role in information
on knowledge discovery process on web.
Figure (2)
Website structure can be easily understood from figure (2).
As we know that website is a collection of related web
pages containing images, videos or other digital assets. In
figure (2) A, B, C, D, E is different pages of website. It is
clear that if hyperlinks are available then we can easily
move between pages. In Web mining website structure is
also important. If the website structure is complex then it
will take less time to move between pages.
WEB MINING CATEGORIZATION
Web Mining can be broadly divided into three distinct
categories, according to the kinds of data to be mined as
shown in Figure (3).
1. Web content mining,
2. Web structure mining
3. Web usage mining.
In the database view of Web content mining, we are
interested in the structure within the web documents
(intra- document structure), while in Web structure mining
we are interested in the structure of the hyperlinks within
the web itself (inter-document structure). Web content
mining is mainly related to the content (mainly text)
structure extracted from web page. Web structure mining
mainly deals with the structure of the web i.e. hyperlink
structure and link analysis. Web usage mining discovers
interesting usage patterns from Web data, in order to
understand and better serve the needs of Web-based
applications.
Web content mining
Web pages consist of different types of data attributes,
such as text, image, audio, video, meta-data, hyper-link
and others. Web content mining is the technology that
discovers web characteristics and properties from various
data-types and attributes values. Web content mining
focuses on the discovery of knowledge from the content of
web pages. At the time of retrieving information web users
try to find out only web pages that interests him from the
huge amount of
available pages. Current search tools suffer from low
precision due to irrelevance of many of the search results
[4] and low recall due to inability to index all the
information available on the web. Search engines aren’t
able to index all pages resulting in imprecise and
incomplete searches due to information overload.
Data mining and knowledge discovery in web pages and
semi-structured documents are extensively studied by
many researchers, and web/text mining targets are
classified into three major application categories:
1. Un-Structured documents
2. Semi-Structured documents
3. Structured documents
The documents written in HTML are Semi-Structured and
in XML are Structured, while the text document are UnStructured in nature. It is difficult to discover appropriate
knowledge from image and multimedia web contents.
Figure (3)
10
Please Cite this Article at: Arun Kumar Singh et al, Journal of Current Engineering Research, 2 (2),March-April 2012,09-12
Web structure mining
Web structure mining is an approach based on directory
structures and web graph structures of hyperlinks [5]. Web
structure mining is closely related to analyzing hyperlinks
and link structure on the web for information retrieval and
knowledge discovery. Web structure mining deals with the
connectivity of websites and the extraction of knowledge
from hyperlinks of the web. The WWW can be viewed
formally as digraph with Web nodes and arcs, where the
Web nodes correspond to HTML files having page
contents and the arcs correspond to hypertext links
interconnected with the Web pages.
There are two major link-based search algorithms, HITS
(Hypertext Induced Topic Search) and PageRank.
HITS uses of concept of hub and authority. Hubs are pages
contain set of hyperlinks. Authorities are pages that
contain useful information about the query. Both types of
pages are typically connected: good hubs contain pointers
to many good authorities, and good authorities are pointed
to by many good hubs. The main drawback of this
algorithm is that the hubs and authority score must be
computed iteratively from the query result, which does not
meet the real-time constraints of an on-line search engine.
It couldn’t be implemented in a real time search engine
[6].
PageRank and Weighted PageRank were given to
overcome HITS’s problem.
Internal crawler
An internal crawler views the webpage of a single given
website and performs focused (page) crawling within that
website. The web pages generated by the internal crawlers
are more reliable [11].
External crawler
An external crawler has a more abstract view of the web as
a graph of linked websites. Its task is to select the websites
to be examined next and to invoke internal crawlers on the
selected sites. External crawler orders unknown websites
and initiates an internal crawl on the first page of the
website only.
APPLICATIONS
Web mining can be used in a no. of applications. Google
Search engine uses Page Ranking algorithm to index its
collection of web pages. Some of the web mining
applications are following
i) Web Search- Google
ii) Web Mining in Distance Education Platform
iii) Terror tracking [12]
iv) On-line communities (AOL)
v) Personalized Portal for the Web – My Yahoo
CONCLUSION
Web Usage Mining
Web usage mining focuses on techniques that could
predict the behavior of users while they are interacting
with the WWW. Web usage mining collects the data from
Web log records to discover user access patterns of Web
pages. It can discover the browsing patterns of user and
some kind of correlations between the web pages. Web
usage mining provides the support for the web site design.
Typical sources of data in web usage mining are
automatically generated data stored in server access logs,
referrer logs, agent logs, and client-side cookies, ands user
profiles.
Web Usage Mining is to mine data from log record on web
page. Log record lots useful information such as URL, IP
address and time and so on. Analyzing and discovering
logs could help us to find more potential customers and
trace service quality and so on [7].
A Web usage mining system performs five major tasks
[8]:
i) Data gathering
ii) Data preparation
iii) Navigation pattern discovery
iv) Pattern analysis and visualization
v) Pattern applications.
WEB CRAWLING
All search engines available on the internet need to
traverse web pages and their associated links, to copy
them and to index them into a web database. The programs
associated with the page traversal and recovery is called
crawlers [9]. The crawling task is divided into two major
subtasks corresponding to the two different levels of
abstraction [10]:
We provided a survey on web mining focusing on current
web mining research. In this paper, we present a
preliminary discussion about Web mining, including the
definition and web mining classification.
We also discussed about web structured mining
techniques. We have seen that crawler is an important
program associated with every search engine, the main
decisions associated with the crawlers algorithms are when
to visit a site again (to see if a page has changed) and
when to visit new sites that have been discovered through
links.
REFERENCES
[1] Raymond Kosala, Hendrik Blockeel, Web Mining Research:
A Survey, SIGKDD Explorations, ACM
SIGKD July
2000.
[2] O.etzioni. The world wield web: Quagmire or Gold Mining.
Communicate of the ACM, (39)11:65-68, 1996.
[3] Rekha Jain, Dr. G. N. Purohit, Page Ranking Algorithms for
Web Mining, International Journal of Computer
Applications (0975 – 8887) Volume 13– No.5, January
2011.
[4] Masashi Toyoda, Masaru Kitsuregawa What’s Really New on
the Web? Identifying New Pages from a Series of Unstable
Web Snapshots, WWW 2006, May 23–26, 2006,
Edinburgh, Scotland. ACM 1-59593-323-9/06/0005.
[5] Hiroyuki Kawano, Yoshida Hommachi Sakyo-ku,
Applications of web mining – from web search engine to
P2P filtering.
[6] Johannes Fürnkranz, WEB MINING, TU Darmstadt,
Knowledge Engineering Group
[7] Hong T, Chiang M, Wang S H, "Mining weighted browsing
patterns with linguistic minimum supports", 2002 IEEE
International Conference on Systems, Man and Cybernetics,
2002,Yasmine Hammamet, Tunisia, pp. 635-639.
11
Please Cite this Article at: Arun Kumar Singh et al, Journal of Current Engineering Research, 2 (2),March-April 2012,09-12
[8] Wen-Chen Hu, Xuli Zong, Chung-wei Lee and Jyh-haw
Yeh, World Wide Web Usage Mining Systems and
Technologies.
[9] Animesh Tripathy, Prashanta K Patra, A Web Mining
Architectural Model of Distributed Crawler for Internet
Searches Using PageRank Algorithm.
[10] Martin Ester, Hans-Peter Kriegel,Matthias Schubert,
Accurate and Efficient Crawling for Relevant Websites
[11] Satyajeet Nimgaonkar and Suryaprakash Duppala, A Survey
on Web Content Mining and extraction of Structured and
Semi structured data
[12] T.Anand, S.Padmapriya and E.Kirubakaran Terror Tracking
Using Advanced Web Mining Perspective lAMA 2009
12