Download - International Journal of Multidisciplinary Research and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Transcript
International Journal of Multidisciplinary Research and Development
Online ISSN: 2349-4182 Print ISSN: 2349-5979
www.allsubjectjournal.com
Volume 3; Issue 3; March 2016; Page No. 82-83; (Special Issue)
A study on web page prediction using Markov models and page rank algorithm
1
1
2
C Thavamani, 2 Dr. A Rengarajan
Research Scholar, Bharathiar University, Coimbatore
Professor, Veltech Multi Tech SRS Engineering, Avadi, Chennai
Abstract
The web is a large source of information that can be turned into knowledge. Web mining is the application of data mining
techniques to discover patterns from the web. Web logs contain information about web server request and response. The purpose
of this paper is to explore ways to exploit the information from web logs for predicting users' web page access. Markov model is
the most commonly used in the identification of patterns based on the sequence of previously accessed page and predication
model because of its high accuracy. To predict the next page access, we use Markov model on the web session. And if ambiguous
results are found, Page Rank algorithm is used for deciding the desired page.
Keywords: Web mining, prediction model, Markov models, page rank algorithm
Introduction
Web usage mining is the application of data mining techniques
to discover usage patterns from Web data in order to
understand and better serve needs of Web based applications.
It consists of three phases, namely pre-processing, pattern
discovery, and pattern analysis. Web servers, proxies, and
client applications can quite easily capture data about Web
usage. Various attempts have been taken the advantage of web
page access prediction by preprocessing web server log files
and analyzing web users' navigational patterns. The Markov
model process calculates the probability of the page the user
will visit next after visiting a sequence of web pages in the
same session. Markov model implementations have been
disturbed due to the fact that low order Markov models do not
use enough history and therefore, lack accuracy, whereas, high
order Markov models incur high state space complexity.
Literature Study
A number of researchers attempt to improve the web page
access prediction precision or coverage by combining different
recommendation framework. For instance many papers
combined clustering with association rules (Lai and Yang
2000, Liu et al. 2001) [1]. Lai & Yang (2000) have introduced
a customized marketing on the Web approach using a
combination of clustering and association rules. The authors
collected information about customers using forms, Web
server log files and cookies. They categorized customers
according to the information collected. Since k-means
clustering algorithm works only with numerical data, the
authors used PAM (Partitioning around Medoids) algorithm to
cluster data using categorical scales. They then performed
association rules techniques on each cluster. They proved
through experimentations that implementing association rules
on clusters achieves better results than on non- clustered data
for customizing the customers' marketing preferences.
Markov Models
Markov models [2] have been extensively used for predicting
the action a user will take next given the sequence of actions
he or she has already performed. For this type of problems,
Markov models are represented by three parameters < X; Y; T
>, where X is the set of all possible actions that can be
performed by the user; Y is the set of all possible states for
which the Markov model is built; and T is a jYj _ jXj Transition
Probability Matrix (TPM), where each entry tij corresponds to
the probability of performing the action j when the process is
in state i. The state-space of the Markov model depends on the
number of previous actions used in predicting the next action.
The simplest Markov model predicts the next action by only
looking at the last action performed by the user. In this model,
also known as the first-order Markov model, each action that
can be performed by a user corresponds to a state in the
model. A somewhat more complicated model computes the
predictions by looking at the last two actions performed by the
user. This is called the second-order Markov model, and its
states correspond to all possible pairs of actions that can be
performed in sequence. This approach is generalized to the
Kth-order Markov model, which computes the predictions by
looking at the last K actions performed by the user, leading to
a state-space that contains all possible sequences of K actions.
Example:- Suppose we have a user session A=<1,2,3,4,5,6,7>
is the sequence of pages a user have visited. Suppose, also,
that we use a sliding window of size 5. We apply feature
extraction to A=<1,2,3,4,5,6,7> and end with the following
user sessions of 5 page length: B=<1,2,3,4,5>, C=<2,3,4,5,6>
and D=<3,4,5,6,7>. Note that the outcome or label of the
sessions A,B,C and D are 7,5,6 and 7, respectively. This way,
we end up with the following four user sessions: A, B, C, and
D. In general, the total number of extracted sessions using a
sliding window of size w and original session of size A is |A|w+1. To extract more knowledge from the user sessions,
we use what we call frequency matrix
1
2
….
N
1
0
Freq(2,1)
Freq(….,1)
0
2
Freq(1,2)
0
Freq(….,2)
Freq(N,2)
….
….
….
….
….
N
Freq(1,N)
Freq(2,N)
Freq(….,N)
0
82
Sample web sessions of the above example,
WS1 : {3,2,1,4,5,6,3}
WS2 : {3,2,1}
WS3 : {4,5,2,1,5,4}
WS4 : {3,5,2,1,4,6,7,5}
WS5 : {1,4,2,5,4,6}
Table 1: First order Transition Probability Matrix
1st Order
S1={1}
S2={2}
S3={3}
S4={4}
S5={5}
S6={6}
S7={7}
1
0
4
0
0
0
0
0
2
0
0
2
1
2
0
0
3
0
0
0
0
0
1
0
4
3
0
0
0
2
0
0
5
1
1
1
2
0
0
1
6
0
0
0
2
1
0
0
7
0
0
0
0
0
1
0
References
1. Faten Khalil, Jiuyong Li, Hua Wang Integrating
Recommendation Models for Improved Web Page
Prediction Accuracy
2. Deshpande M, Karypis G, Selective Markov Models for
Predicting Web Page Accesses, ACM transactions on
Internet Technology, 2004; 4(2):163-184.
3. Bing Liu, Web Data Mining Exploring Hyperlinks,
Contents, and Usage Data, Springer-Verlag Berlin
Heidelberg 2007.
4. Tasawar Hussain, Dr. Sohail Asghar, Dr. Nayyer Masood,
Web Usage Mining: A Survey on Preprocessing of Web
Log File.
5. Payal Gulati, A Novel Approach for Determining Next
Page Access, 2008, IEEE.
Page Rank Algorithm
The importance of a page is proportional to the sum of the
importance scores of pages linking to it. The justification for
using Page Rank for ranking web pages comes from the
random surfer model. Page Rank models the behavior of a
web surfer who browses the Web. The Web surfer starts from
a random node on the graph, user clicks on hyperlinks forever
and picks a link uniformly at random on each page to move on
to the next page. The number of times the surfer has visited
each page is counted. Page Rank of a given page is this
number divided by the total number of pages the surfer has
browsed. Page Rank is a static ranking of web pages in the
sense that a Page Rank value is computed for each page offline and it does not depend on search queries [3]. The Web is
treated as a directed graph G = (V, E), where V is the set of
vertices or nodes, i.e., the set of all pages, and E is the set of
directed edges in the graph, i.e., hyperlinks. In page rank
calculation, especially for larger systems, iterative calculation
method is used.
4 Proposed System
The proposed system focuses on the improvements of
predicting web page access. The process is as follows:
Prediction: Begin For each coming session Use Markov model
to make prediction if the predictions are made with the
ambiguous result Use page rank algorithm to make a
prediction End If End for End
Conclusion and Future Work
Markov model is the most commonly used prediction model
because of its high accuracy. Low order Markov models have
higher accuracy and lower coverage. Higher-order Markov
models and hidden Markov models are more accurate for
predicting navigational paths. Page rank algorithms and
Markov model are commonly used for next page prediction. In
addition, popularity of pages in page rank can be considered as
well. However, the similarity of page is not yet considered for
page ranking algorithm. And the popularity factor may depend
on the concept of page. In future, we are going to propose a
new hybrid technique which integrates higher order markov
model and Popularity and Similarity based Page Rank (PSPR)
models for next page prediction will be a promising approach
than that of previous similar model.
83