Download paper ID-2620143

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
Minimizing the Repeated Database Scan Using an
Efficient Frequent Pattern Mining Algorithm in Web
Usage Mining
Devinder Kaur1, Ravneet Kaur2
Student of Master Technology1, Assistant Professor2
Department of Computer Science and Engineering1, 2
Sri Guru Granth Sahib World University
Fatehgarh Sahib, Punjab, India
Abstract— Data Mining, is the process of discovery of new patterns and knowledge from large dataset. Web
mining is the application of data mining techniques to extract and mine useful knowledge and interesting
patterns from World Wide Web .Web data including web documents, hyperlinks between documents, usage logs
of web sites. The web usage data captures the identity and origin of the web user along their surfing behaviour at
a website. The aim of discovering frequent patterns in Web log data is to obtain information about the
navigational behaviour of the users. Mining frequent patterns from web log data can help to optimise the
structure of a web site and improve the performance of web servers. In this paper First we investigate process
used for maximal forward reference. And then a new approach is proposed to modify the process of find
maximal forward reference through backward scan algorithm. After use proposed modified algorithm time and
space complexity will reduce. The backward scan algorithm is proposed for frequent pattern mining in web
usage mining. It scan the web log database from backward level.
Index Terms—Data mining, World Wide Web, traversal pattern.
1. INTRODUCTION
Data mining is the process of discovering interesting
knowledge from large amount of data. Data mining
refers to discover knowledge in huge amounts of data.
Knowledge discovery in database is the non trivial
process of identifying valid, potentially useful and
ultimately understandable patterns in data [1]. The
web mining is a combination of the two singular areas
of in progress one is the data mining and second one
is world wide web (WWW). It can be able to be
mostly defined as the finding and discovering the
useful information from WWW. Web mining is the
make use of data mining discover and mine
information from Web documents and services. Web
Data Mining is the application of data mining
techniques to find interesting and potentially useful
knowledge from web data. Web mining has three
types web content mining, web structure mining and
web usage mining. It is normally expected that either
the hyperlink structure of the web or the web log data
or both have been used in the mining process [2].
Table 1 Relationship between Content, Structure and
Usage mining
Type
Structure Forms
Object
Collection
Usage
Accessing
Click
Behaviour
Logs
Structured
Pages
Text
Index
Pages
Content
Map
Hyperlinks
Map
Hyperlinks
2. LITERATURE REVIEW
Hemant Kumar Singh and Brijendra Singh [3]
introduced about the web content mining, web
structure mining and web usages mining. The aim
of this paper is to provide past, current evaluation
and update in each of the three different types of
web mining i.e. web content mining, web structure
mining and web usages mining .It also represents
the comparisons and summary of various methods
of web data mining with applications. Jaideep
Srivastava et al. [4]. Describe the Web usage
mining phases, namely preprocessing, pattern
discovery, and pattern analysis and its application.
It provides a detailed taxonomy of the work in this
area, including research as well as commercial
offerings.and practice communities. Paweł
Weichbroth et al. [5] describe problem of mining
38
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
access patterns from Web logs efficiently. A novel
data structure, called Web access pattern tree, or
WAP-tree in short, is developed for efficient
mining of access patterns from pieces of logs. The
Web access pattern tree stores highly compressed,
critical information for access pattern mining and
facilitates the development of novel algorithms for
mining access patterns in large set of log pieces.
Renata Ivancsy and Istvan Vajk [6] presented
discovering frequent patterns in Web log data is to
obtain information about the navigational
behaviour of the users. The different patterns in
Web log mining are page sets, page sequences and
page graphs. This can be used for advertising
purposes, for creating dynamic user profiles etc..
Yan Li et al. [7] the proposed path completion
algorithm efficiently appends the lost information
and improves the reliability of access data for
further Web usage mining calculations. Ming-Syan
[8] proposed a new data mining algorithm that
involves mining path traversal patterns in a
distributed information-providing environment
where documents or objects are linked together to
facilitate interactive access. Their solution
procedure consists of two steps. First, derive an
algorithm to convert the original sequence of log
data into a set of maximal forward references.
Second derive another algorithm to determine the
frequent traversal patterns i.e. large reference
sequences from the maximal forward references
obtained. Jianhan Zhu [9] applied the Markov
chains to model user navigational behaviour. They
proposed a method for constructing a Markov
model of a web site based on past visitor
behaviour. Then the Markov model is used to make
link predictions that assist new users to navigate
the Web site. WANG Tong [10] offers an improved
algorithm based on the original AprioriAll
algorithm. The new algorithm adds the property of
the User-id during the every step of producing the
candidate set and every step of scanning the
database by which to decide whether an item in the
candidate set should be put into the large set which
will be used to produce next candidate set.
Hengshan Wang [11] introduced two prevalent data
mining algorithms – Fpgrowth and PrefixSpan into
WUM. Maximum Forward Path (MFP) is also used
in the web usage mining model during sequential
pattern mining along with PrefixSpan so as to
reduce the interference of “false visit” caused by
browser cache and raise the of mining frequent
traversal paths. Sandeep Singh Rawat [12]
proposed a custom-built apriori algorithm which is
based on the old Apriori algorithm, to find the
effective pattern analysis. Ankit R Kharwar et al.
[13] describe implements the high level process of
Web Usage Mining using basic Association Rules
algorithm call Apriori Algorithm. It presents
finding association Rule from server log which are
useful in many application like cache for web page,
Marketing, Targeted and Advertising etc. Mr.
Rahul Mishra and Ms. Abha Choubey [14]
proposed the FP growth algorithm for obtaining
frequent access patterns from the web log data and
providing valuable information about the user’s
interest.
3. WEB MINING CHALLENGES
Today the World Wide Web is popular and
interactive medium to distribute information. The
web is huge, diverse, dynamic and unstructured
nature of web data , web data research encountered
lot of challenges for web mining. Information user
could encounter following challenges when
interacting with web.
1.Finding Relevant Information:- People either
browse or use the search service when they want to
find specific information on the web. Today’s
search tools have problems like low precision
which is due to irrelevance of many of the search
results. This results in a difficulty in finding the
relevant information. Another problem is low recall
which is due to inability to index all the
information available on the web.
2. Creating new knowledge out of the information
available on the web:- This problem is basically
sub problem of the above problem. Above problem
is query triggered process (retrieval oriented) but
this problem is data triggered process that presumes
that already have collection of web data and extract
potentially useful knowledge out of it.
3. Personalization of information:- When people
interact with the web they differ in the contents and
presentations they prefer.
4. Learning about Consumers or individual users:This problem is about what the customer do and
want. Inside this problem there are sub problem
such as customizing the information to the intended
consumers or even to personalize it to individual
user, problem related to web site design and
management and marketing [15].
4. BASIC PROBLEM OF FREQUENT ITEMSET
Association Rules find all sets of items that have
support greater than the minimum support and then
using the large item sets to generate the desired
rules that have confidence greater than the
minimum confidence. An algorithm for finding ass
rule named as AIS was proposed by R.S. Aggarwal
in 1993. There are several algorithms for frequent
itemset. All the variations make for apriroi
algorithms
have
some
advantages
and
disadvantages.
39
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
Algorithms
Apriori
Storage
Structure
Array
based
Apriori Tid
Array
based
FP tree
Tree based
Custom
built Apriori
Array
based
Advantages
Disadvantages
Any subset
of frequent
item set is
also frequent
item set.
Multiple scans
have to be done
on database. Its
time
and
memory
complexity
is
very large.
The time and
space
complexity
is
also very large
Number of
entries may
be smaller
than
the
number of
transactions
in
the
database.
Scans
the
database
only twice.
It
has
interactive
rule mining
Based on old
apriori
algorithm. It
is efficient
and effective
pattern
analysis.
It seems to be
difficult
in
incremental. It
require
less
memory
and
more execution
time
It requires more
memory
and
more execution
time.
5. PROPOSED APPROACH
The proposed algorithm has based on frequent pattern
mining using web log data. The basic objective of the
algorithm to obtain the maximum traversing path of
the users from web log database. It scan the web log
database from the backward. This algorithm will
makes the pattern mining process effective.
Algorithm : Backward Scan.
Input : User traversal pattern.
Output : Maximal forward reference.
Step 1 : Input data set and min threshold value.
Step 2 : Calculate the length of longest itemset in dataset.
Step 3 : Repeat from longest itemset length.
a. Generate candidate of k level.
b. Calculate the count of candidate
c. If count of any candidate is greater than min
threshold value.
Then print result.
Else repeat step 3.
Step 4 : Exit.
Illustrative Example
The proposed algorithm consists of various steps
which will explain with the user traversal pattern. The
web has like tree structure with pages being represent
as nodes which denoted by alphabets and hyperlinks
represented by arrows.
Min threshold = 1
C
B
E
A
F
D
H
G
Fig 1 User Traversal Pattern
TID
Database
100
ABE
200
ADH
300
ACG
400
ACG
500
AD
600
ACG
Table 2 web log database
The above database is generate from the user traversal
pattern graph to obtain the maximal forward
reference.
Step1 : In the first step generate candidates from
graph with the transaction id and number of nodes
denoted by alphabet during access the transaction.
Step 2 : In the next step generate the length of
longest candidate level from the graph.
40
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
REFERENCES
length
ABE
ACF
ACG
ADH
Table 3 Candidate length
Step 3 : After generation of candidate level then
count the number of occurrence of each candidate
from the web log database. The candidate count
satisfy the minimum threshold value.
length
count
ABE
1
ACF
0
ACG
3
ADH
1
Table 4 length count
The above shows count of each candidate. The
maximum candidate count is 3 which has maximum
traverse path of the user. Otherwise generate level 2
candidate from the graph.
The proposed algorithm will improve the efficiency of
traditional apriori algorithms. The backward scan
algorithm is modified approach for frequent pattern
mining in web usage mining. It will efficient from all
previous algorithms. The time and space complexity
will reduce used by this algorithm because it’s
minimizing the repeated database scan for Frequent
Pattern Mining in web usage mining. The maximal
forward reference will easily obtain by use this
algorithm.
6. CONCLUSION
In this paper first data mining and web mining
categories have been discussed. Frequent mining
algorithms also has been analyzed along their
advantages and disadvantages. An algorithm has been
proposed for frequent pattern mining in web usage
mining which is efficient than traditional Apriori
algorithms. The first part of algorithm, i.e. backward
scan. firstly scan the web log database and obtain the
longest candidate level length. After that count the
occurrence of each candidate. Each candidate count
satisfy the minimum threshold value and then obtain
the maximum forward reference from candidate count
length. The new approach requires minimum repeated
database scan for frequent pattern mining in web
usage mining. It will reduce the time and space
execution.
[1] Ping Tang Nig, “Introduction to Data Mining,”
Addision Wesley publishers 2006.
[2] Monika Yadav and Mr. Pradeep Mittal, “Web Mining:
An Introduction”, in proceeding of International
journal of Advanced Research in Computer Science
and Software Engineering, vol.3, issue 3, March 2013.
[3] Hemant Kumar Singh and Brijendra Singh, “ Web
Data Mining Research: A Survey”, in proceeding of
IEEE International Conference on Computational
intelligence and Research (ICCIC), pp.1-10, December
2010.
[4] Jaideep Srivastava, Robert Cooley, Mukund
Deshpande, Ping Ning Tang, “Web Usage Mining:
Discovery and Application of Usage Pattern from Web
Data”, in proceeding of ACM SIGKDD, vol. 1, issue 2,
pp.1 -12, January 2000.
[5] Paweł Weichbroth, Mieczysław Owoc, Michał
Pleszkun, "Web User Navigation Patterns Discovery
from WWW Server Log Files", in proceeding of IEEE
Federated Conference on Computer Science and
Information Systems pp. 1171–1176, 2012.
[6] Paweł Weichbroth Mieczysław Owoc Michał
Pleszkun, "Web User Navigation Patterns Discovery
from WWW Server Log Files" , in proceeding of IEEE
Federated Conference on Computer Science and
Information Systems pp. 1171–1176, 2012.
[7] Yan Li, BoQin Feng and Qinjiao Mao, "Research on
Path Completion Technique in Web Usage Mining", in
proceeding of IEEE International Symposium on
Computer Science and Computational Technology
(ISCSCT), Vol. 1, pp. 554-559, December 2008.
[8] Ming Syan Chen, Jong Soo Park, and Philip S. Yu,
“Efficient Data Mining for Path Traversal Pattern”, in
proceeding of IEEE transactions on knowledge and
data engineering vol. 10, no. 2, pp. 209-221, April
1998.
[9] Jiahan Zhu, Jung Hong, and John G. Hughes, “Using
Markov Chain for Link Prediction in Adaptive Web
Sites”, in proceeding of First International Conference
of Computing in an Imperfect World, pp. 60-73, 2002.
[10] Wong Tong and He Pi-lian, “Web Log Mining by an
Improved AprioriAll Algorithm”, in proceeding of
Second World Enformatika Conference, 2005.
[11] Hengshan Wang , Cheng Yang and Hua Zeng ,
“Design and Implementation of a Web Usage Mining
Model Based On Fpgrowth and Prefixspan”, in
Communications of the IIMA, vol. 6 ,no. 2,2006.
[12] Sandeep Singh Rawat and Lakshmi Rajamani, “ Web
Discovering Potential User Browsing Behaviors Using
Custom built Apriori Algorithm”, in proceeding of
International Journal of Computer Science &
Information Technology, Vol.2, No.4, August 2010.
[13] Ankit R. Khawar, Viral Kapadia, Nilesh Parajapati,
Premal Patel, “Implementation Apriori Algorithm on
Web Serve Log”, in proceeding of National
Conference on Recent Trends in Engineering &
Technology, May 2011.
[14] Mr. Rahul Mishra and Ms. Abha Choubey, “Discovery
of Frequnet Pattrens from Web Log Data by Using FPGrowth Algorithm for Web Usage Mining”, in
proceeding of Internation Journal of Advanced
41
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
Research in Computer Science and Software
Engineering vol. 2, Issue 9, September 2012.
[15] Raymond Kosala and Hendrik Blockeel, “Web
Mining Research: A Survey”, in ACM SIGKDD vol. 2,
pp. 1-15, June 2000.
[16] D. Jayalatchumy and Dr. P. Thambidurai, “Web
Mining Research Issues in Future Directions : A
Survey”, in proceeding of ISOR journal of Computer
Engineering Vol. 14, Issue 3, pp. 20-27, October 2013.
42