Download A Comparative Study of Various Issues in Web Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal of IT, Engineering and Applied Sciences Research (IJIEASR)
Volume 3, No. 2, February 2014
ISSN: 2319-4413
A Comparative Study of Various Issues in Web Mining
Monika Pathak, Assistant Professor, Multani Mal Modi College, Patiala,India
Sukhdev Singh, Assistant Professor, Multani Mal Modi College, Patiala,India
ABSTRACT
Due to increased amount of data available online, World
Wide Web is the most usable and valuable way to retrieve
data. But it is difficult to find relevant information for a
particular application as data is available in different
forms and formats. Web mining is the one of the most
suitable technique of data mining to extract useful
information from web.In this paper, we have discussed the
web mining process with general architecture. It also
contains elementary technical detail of major applications
of web mining with suitable examples.The paper explores
the technical difference between data mining and web
mining on the basis of implementation issues. The
objective of the study is to identify the issues and
categories these issues according to implementations. The
issues are categorised into three major categories: social
issues, technical issues and law enforcement issues.
Further we have compared these categories to find out the
sensitivity of the issues.The paper concluded with few
suggestions to overcome technical issues related to
heterogeneous representation of data to identify future
directions.
Keywords:
Web mining, Web Usage Mining, Web Content Mining,
Web Structure Mining, Clustering analysis, Pattern
analysis.
I. INTRODUCTION
An abundance of information is available over the internet
but it is hard to find relevant information for a particular
application. The web mining techniques are used to
Data mining
Data mining is a process of extracting data from
large databases such as oracle, sql server, db
database etc. It is also known as knowledge
discovery in databases .
Data mining techniques process large number of
data from databases where data is available in
structured form such as tables, files and views. In
tabular form data is stored in tables and linked with
each other with primary and foreign keys.
Data is stored in database managed by database
management system and DBMS gives authorization
discover valuable information from web resources.The
web is the biggest and most widely used source or way to
extract/search information. Web consists of billions of
interconnected web pages. By clicking on these web
pages, we can find or share any type of information on the
web. The Web has become a channel for business, online
shopping, sharing information and opinions with other
people from anywhere in the world. Web acts as a virtual
society. The actions and operations [2] on web depend
upon the structure of hypertexts which allows web page
users to link their documents with other related
documents. Web mining [3]allows users to extract or
mines useful or relevant information on internet but it’s a
challenging task due to presence of huge, continuously
growing and wide coverage of data. Data is present in
different forms like audio, video, text, structured and
unstructured, images etc. Web mining is the data mining
technique to automatically discover and extract useful
information from web or internet. Web mining provides
many types of information like web activity (activity
tracking), web graph [4](links between pages and people)
and web content (data in web pages and documents). Web
content mining extracts useful information from the web
page [1] contents and web usage mining maintains user
access patterns from usage logs means it records clicks
used by every user. Web structure mining defines the
structure of the web and extracts useful information from
the hyperlinks. Information provided by web mining also
depends upon multiple factors like size of web, number of
web pages and number of websites.The technical issues as
compare to data mining web mining is a more complicated
process and we emphasis the same in comparison made
between both.
Web mining
Web mining is the application of traditional data
mining techniques. It is a technique to collect the
data from various web resources like html pages, xml
etc.
Web mining technique processes large number of
data as compared to data mining techniques where
data is available in heterogeneous form such as data
embedded in html and xml, heterogeneous
representation of data over the web.
Data is stored in the form of web pages and web files
in public domain. Data is public and not secured. The
i-Explore International Research Journal Consortium
www.irjcjournals.org
31
International Journal of IT, Engineering and Applied Sciences Research (IJIEASR)
Volume 3, No. 2, February 2014
ISSN: 2319-4413
to users to access the data. That is why data is private registration can be imposed to web resources require
and secured
authentication and complicated process for mining.
Traditional data mining has levels of explicit Web mining task is processed under unstructured or
structure and representation.
semi-structured data from web pages.
User needs access right to read the data because data User rarely requires access right to read the data
because data is public on web resources.
is private.
Data is stored in a database and has restricted access Over the web, data is globally available and within
to the users. User cannot share their common an organization data is available over the intranet
interests because there is no interconnectivity which acts as a platform for sharing information
between different databases. Different database may among different users.With the help of hyperlinks,
users can share common interest with each other.
have different schema definition and cannot share.
Table1: Demonstrate complication among Data miming and Web Mining
The present study aims to discover challenges come across
while performing web mining to extract information from
large web based data so that processed information can be
utilized for a particular application such as product
promotion, SSR etc.
II. WEB MINING
Web mining is a process of extracting useful information
from the web resources. Many organizations and
companies use web mining techniques to extract and share
the useful information for their business development.
Web mining technique also raises an idea of data security
of personal information available on the internet.Different
web mining techniques [5] are used to retrieve information
from the web, based on web content mining, web structure
Finding Web
Resources
Select Type of
Information
mining and web usage mining. Web mining is an
application of data mining technique.
Web Mining Process:The web mining is a processing of
discovering facts from predefined data extracted from web
resources. The interest of information may vary according
to the need of particular applications. In general, web
mining processing [6] can be expressed in through
following steps:
Find the resource:In this step web resource is located and
source document is finalized.
Select type of information: The type of information is
finalised so that selection of information is automated for
processing information from web resources.
Generalize the
Information
Analysis the
Information
Extract Required
Information
Figure1: Web Mining Process
Generalize the information: The information is further
processing to general pattern from websites.
Analysis the information: The extracted information is
validatedand interpreted to find the patterns. The patterns
are processed to represent information.
Extract the required information: The information is
extracted after pattern matching and can be filter further
according to the requirement of a particular application.
Web mining has many applications which help users to
extract useful information and make suitable decisions.
III. ARCHITECTURE OF WEB MINING
The web mining is categorized into three basic categories
which are based on methods used to access information
over the web. The web resources are processed on the
bases of content, structure and usages. The following
diagram demonstrates different categories used in
architecture.
Web Content Mining (WCM): It is also known as text
mining. Web content mining [7] is the process extracting
useful and important information from the web page
contents. Content data may consist of text, images, audio,
video, structured and unstructured tables. Web content
mining or text mining is mostly used in discovery and
tracking, clustering of web pages [8] and classification of
web pages. Web content mining is used to gather,
categorize and provide the best possible information
available on the web.
In short, web content mining or text mining allows search
engines to increase the flow of user clicks to web sites,
web pages of websites to solve their queries.It provides the
results in terms of highest relevance to search engines and
provides specific information to the user. It reduces the
irrelevant information due to navigation of information on
web and provides higher quality of results to users.
i-Explore International Research Journal Consortium
www.irjcjournals.org
32
International Journal of IT, Engineering and Applied Sciences Research (IJIEASR)
Volume 3, No. 2, February 2014
ISSN: 2319-4413
Figure2: Web Mining Categorization
Web Structure Mining (WSM): Web structure mining [9]
consists of web pages and hyperlinks connecting related
pages.It analyse the structure of each page contained in a
website. It represents the structure of web pages and
hyperlinks. With the help of hyperlinks, users share their
interest with each other. Based on structural information, it
is further divided into two categories:
Hyperlinks:A hyperlink is a unit which connects a web
page with other web pages. A hyperlink [10] which
connects to different part of the same page is called intradocument hyperlink. Hyperlink which connects two
different pages is called an inter- document hyperlink.
Structure of Documents:The information present on web
is available in a structured format and present in a
hierarchical manner. The web mining techniques is used
todiscovering useful knowledge from the structure of links
or topology of the hyperlinks between web pages. It is
useful for determine most accessed pages. It aims to
analyse the wayin which different web documents are
linked together.
suitable usage pattern from web usage data to understand
the web based applications. It is used to understand the
user’s behaviour and need with the information available
in the web site. It provides data about referring page,
sequencing of pages visited, cookies files contain
information and user spent time at site.It is very difficult to
track the user through a site because only bits of
information like IP address, user information and site
clicks are available. Web usage mining is divided into
following types:
Data of Web Server Logs: It contains information about
name, IP address, access time and resources location.
Data of Application Server Logs: It contains information
of user activities like IP address, request source.
Data of Application Level Logs: It maintains information
of user at application level like number of hits on web
page, old reference and new references.
We also compare web mining categories based on the
availability of data, type of data, methods and applications.
Web Usage Mining (WUM):Web usage mining [11] is the
application of data mining technique which finds out
Issues
Availability of Data
Type and form of data
Methods
Application categories
Web Contents Mining
Structured/
Semi
Structured
Text/ Hyper Text
Web Structure Mining
Linked Structures
Web Usage Mining
Interactivity
Linked
Server Log and client log
from web Browser
Statistical and Machine
Learning methods
Marketing
and
Management applications
Statistical and Machine Proprietary algorithm
Learning methods
Application based on Clustering applications
pattern matching
Table2: Demonstrate category of Web Mining
IV. APPLICATIONS OF WEB MINING
The web mining techniques are used to extract information
from large database of web so that extracted information
can be used for meaningful task. The large data is keep on
adding over the web day by data, a large number of
application are proposed to implicate information
extracted from web mining. We have introduced few
applications according to current scope of the research in
mind.
i-Explore International Research Journal Consortium
www.irjcjournals.org
33
International Journal of IT, Engineering and Applied Sciences Research (IJIEASR)
Volume 3, No. 2, February 2014
Business Intelligence:Webmining is an application of
data mining which helps organizations to promote their
business by reducing the cost of product and increase
profits by selling more products. Many type of’ customers
and their profiles are available on internet which helps
companies to provide their services to customers
according to their needs.Web mining is a powerful
technique which collects information about customer’s
activities on website, helps in decision making process and
predicts the customer’s behaviour. This helps companies
to develop new products and services.
•
•
Modification of Website:The structure and data of the
website is important for the customer’s preferences.
Normally many types of users have different priorities,
knowledge, and preferences etc. which make it difficult to
find design suitable for all types of users. In this case, web
usage mining is used to find out the types of users
accessing the website according to their preferences,
knowledge and design the website based on user’s
priorities.
Improvement in System:In web mining technique, web
traffic navigates the path of the user which improves the
performance and services of websites. For example:
cashing, load balancing. The navigation of path used in
detection of fraud, break-ins etc.
•
•
Personalization of Web: It is an attractive application area
which helps web based companies by allowing them
recommendations and marketing campaigns etc. and
automatically do this in real time when the user access the
website.
V.
COMPARATIVE
ANALYSIS
OFISSUES IN WEB MINING
•
The Present study is focused on identifying the issues
related to different phases of web mining process and
categorized them so that they can be explore with relevant
suggestions. The following are the categories of issues:
Social Issues, Technical Issues and Legal Issues:
Social Issues: The social issues cover the privacy,
optimum use of data along with reliability of the data.
•
•
Privacy Issues by Web Mining: Web data mining
involves the use of personal data on web. The
security/privacy of personal data of user on the web
is an important issue. The privacy is violated when
user’s personal information is obtained and used
without user’s knowledge. When user’s personal
data is extracted through web mining, then it is a
privacy violation process.
Optimum Extraction of Data: A large number of
data is available on web and data is duplicated at
i-Explore International Research Journal Consortium
•
•
ISSN: 2319-4413
different locations which raise issues of reliability
of data. A mechanism is required to extract
optimum form of data so that relevant information
can be gathered.
Reliability: World Wide Web is an open global
system of sharing information which raises issues
of reliability of data. The reliability issue
incorporate validity of data, validity of source and
consistency of data. It enforces the policies which
must ensure validity and consistency of data which
is available at multiple web resources.
Technical Issues: The technical issues cover the
issues related to implementation of web mining
process and issues related to analysis of pattern to
discover information.
Segmentation of Web Page: In web based
application, a web page contains information in
heterogeneous form with additional advertisements
and commercials. The objective of any web mining
tool is to extract main contents of the web page and
ignore additional information such as external links,
copy right information, advertisement notes. These
issues are related to segmentation [12] of web page
which requires state of art.
Noisy Information:The information available on
web is noisy. The noise arises due to the two main
reasons. First is web page contains many
information like main content of the page,
advertisements, navigation links, privacy policies
etc. The only small portion is useful and rest is
considered noisy information. Second is web has no
control of information means one can write any
time of information of his/her choice having very
poor quality in large amount on the web.
Knowledge Synthesis: The knowledge synthesis
[13] is one of the burning issues of web mining in
which we need to specify hierarchy of information
extracted from multiple resources. The ontology of
information should be generalized to cater
heterogeneous information gathered from multiple
resources. The information should be synthesised in
such a way that it should be presentable and
correlated with contents.
Heterogeneous Information: The multiple pages
show the similar information in different words or
formats on the web. It proves that the integration of
information from multiple pages is a challenging
problem in web mining.
Integrity: The law of integrity enforce that data
available over the web should be correct and
consistent. The web mining tool collects the data
from web but the issue of integrity is still a manual
job is automation of integrity rules cannot be
generalized on heterogeneous data [14].
www.irjcjournals.org
34
International Journal of IT, Engineering and Applied Sciences Research (IJIEASR)
Volume 3, No. 2, February 2014
ISSN: 2319-4413
Figure3: Web Mining Issues Classification
Law Enforcement: The issues covered in law enforcement
category handle law policies, authentication, and crime
detection.
• Law Policies for Data Sharing:With the increase
of sensitive information over the internet, results
with criminal activities such as piracy, leakage of
sensitive organizational information on the web.
Every organization should have certain law policies
regarding the sharing of information over the
internet.
• Authentication: The information over the web is
available globally which allows misuse of sensitive
information in anti-social activities. There must be
some mechanism to identify the user of information
so that he/she can be tracked to avoid criminal
activities.
• Crime Detection: The information over the web in
the form of email and document can be processed
under web mining to find evidence and clues.
VI.
sort out if web resources use uniform format of
information representation like xml structures.
VII.
CONCLUSIONS
Web mining is an application of data mining technique to
discover the useful information from the World Wide
Web. Data mining extracts information from databases,
but web mining discovers data from web. Data can be text,
images, audio, video, tables etc. Web mining ranks the
various websites which helps the organizations to find the
user’s behaviour, needs, preferences etc. so that
organizations can promote their products properly and to
gain maximum profit. But this technique suffers with
various problems like poor quality of data due to noisy
information provided by the number of websites. User’s
personal data is available on the web resources and anyone
can use this data without user’s knowledge and creates
privacy issues.In this paper, we have categorized different
issues into three major categories namely social, technical
and law enforcement issues. Further we have conducted
comparative analysis of these issues. It has been observed
that most of the problems related to noisy data is due to
heterogeneous structure of web resources which can be
i-Explore International Research Journal Consortium
REFERNCES
[1]D. Fetterly, M. Manasse, M. Najork, and J.
Wiener,“A Large-Scale Study of the Evolution of
Web Pages
”, In proceeding
of the 12th
International World Wide Web Conference, pp.
669–678, 2003.
[2] R. Kosala, “ Web Mining Research: A Survey“,
ACM SIGKDD Explorations, Vol 2, pp 1-15, 2000.
[3] R. Kosala, H. Blockeel and F. Neven," An Overview
of Web Mining", In J. Meij, editor, Dealing with
the Data Flood: Mining Data, Text and Multimedia,
pages 480–497. STT, Rotterdam, 2002.
[4] H. Ino, M. Kudo, and A. Nakamura, "Partitioning of
Web Graphs by Community Topology",in Proc. of
the 14th Intl. Conf. on World Wide Web, pp. 661–
66,2005.
[5] Han, J., Kamber, M. Kamber, "Data mining:
concepts and techniques, Morgan Kaufmann
Publishers, 2000.
[6] R. Kosala, H. Blockeel, “Web Mining Research: A
Survey”, SIGKDD Explorations, Newsletter of the
ACM Special Interest Group on Knowledge
Discovery and Data Mining Vol. 2,pp 1-15, 2000.
[7] Raymond Kosala, HendrikBlockeel, "Web Mining
Research: A Survey",ACM SIGKDD Explorations
Newsletter, Vol. 2, 2000.
[8] B. Mirkin, "Clustering for Data Mining: A Data
Recovery Approach",Chapman & Hall/CRC, April
29, 2005.
[9] P Ravi Kumar, Singh Ashutoshkumar, "Web
Structure Mining Exploring Hyperlinks and
Algorithms for Information Retrieval", American
Journal of applied sciences,Vol.7,: pp. 840845,2010.
[10] J. Kleinberg, "Authoritative Sources in a
Hyperlinked Environment", Journal of the ACM 46
(5), pp. 604–632, 1999.
www.irjcjournals.org
35
International Journal of IT, Engineering and Applied Sciences Research (IJIEASR)
Volume 3, No. 2, February 2014
[11] R. Cooley, "Web Usage Mining: Discovery and
Application of Interesting Patterns from Web
Data", PhD thesis, University of Minnesota, 2000.
[12] K. Lerman, L. Getoor, S. Minton, and C. Knoblock,
"Using the Structure of Web Sites for Automatic
Segmentation of Tables", inthe proceeding of the
ACM SIGMOD in international conference
onManagement of Data (SIGMOD’04), pp. 119–
130, 2004.
i-Explore International Research Journal Consortium
ISSN: 2319-4413
[13] R. Cooley, B. Mobasher, and J. Srivastava, "Data
Preparation for Mining World Wide Web Browsing
Patterns", Knowledge and Information Systems,
pp.5–32,1999.
[14] W. W. Cohen, "Integration of Heterogeneous
Databases without Common Domains Using
Queries Based on Textual Similarity”,inthe
proceeding of ACM SIGMOD Conference on
Management of Data, pp. 201–212, 1998.
www.irjcjournals.org
36