Download 1435596563

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
AN INSIGHT IN TO WEB MINING IN PARTICULAR WEBLOG FILES TO
UNDERSTAND AND PREDICT BEHAVIORAL PATTERNS OF WEB USERS USING
INTEGRATED MARKOV MODEL

JOTHIS CHEMBATH, Ph.D. Research Scholar, Department of Computer Science,
Karpagam University, Coimbatore, India.
** Dr.S.K.MAHENDRAN, Director, SVS Institute of Computer Applications, Coimbatore,
India
Abstract:
Web server maintains log files listing every request made to the server. A log file is a file
that contains a list of events including the user access date & time, username, user’s IP address,
user’s requests, browser version, operating system of the accessing user, URLs of sites and page
on sites that referred visitor to a particular page. These list of events contains a variety of
information which may be further useful for identifying the pattern, trends and knowledge to
better predict the behavior of the user; who logged on to the server. The present research paper is
an attempt for identifying the major issues and challenges associated with web log files and the
potentials of web mining for resolving such issues unveiling a novel approach in doing it.
Finally, this study proposes the solutions which address these problems using Integrated Markov
Model.
KEYWORDS: Web usage mining, Weblog;, Pattern Discovery, APRIORI,MARKOV Model,
Prefetching, WUM
PREAMBLE:
Web mining can be broadly defined as discovery and analysis useful information from
the WWW. Based on the different emphasis and different ways to obtain information. Web usage
mining is the process of extracting useful information from server logs e.g. use Web usage
mining is the process of finding out what users are looking for on the Internet. Web Usage
Mining is the application of data mining techniques to discover interesting usage patterns from
Web data in order to understand and better serve the needs of Web-based applications. Usage
data captures the identity or origin of Web users along with their browsing behavior at a Web
site some users might be looking at only textual data, whereas some others might be interested in
multimedia data.
Source: Figure 1 International Journal of Latest Trends in Engineering and Technology
(IJLTET)
Vol. 4 Issue 1 May 2014.
Web Mining on Web Log Files
Web mining is one of the applications of data mining. When data mining is applied to the
World Wide Web; the term data mining is replaced with Web mining. Clustering of web pages
through structure mining provides the results. Web Mining can be used to gather, categorize,
organize and provide the best possible information available on the web to the browser who
requests the information. The other is Web usage mining (WUM), which focuses on analyzing
visiting information from logged data in order to extract usage pattern, which can be classified
into three categories: similar user group, relevant page group and frequency accessing path.
Figure 1a:
Web Log Processing for data analysis
READ WEB SERVER LOG FILES
A wealth of information about the activities of visitors is available from web server log
files.
Web server log file entries typically look similar to this:
212.209.212.66 - [13/Jun/2015:00:35:33 -0500] "GET /data-mining.htm HTTP/1.1" 200 11631
"http://internetmarketingengine.com/" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)".
ISSUES AND CHALLENGES OF WEB LOG
Since the Web log data is usually diverse and voluminous. This data must be assembled
into a consistent, integrated and comprehensive view so that the web users can obtain the
knowledge as and when required.
The prospective issues and challenges are:
Information retrieval is difficult from the huge data available in Web server. Thus, finding out
the frequent URL’s on World Wide Web to give some substantial results for speeding up the
Web access is an important issue. To increase the usability of Web sites and increase in speed of
accessibility for such web sites is another prospect of research.
Structuring web log data to better analyze the different attributes of Web log files, helps to
understand the significance of attributes to have effective Web mining. After reviewing the web
access history, it has been found that lot of inconsistency, incorrect and missing values are
incorporating in Web log files. Thus, certain web mining preprocessing algorithms are needed
to be applied or to propose new algorithms for efficient management of Web data. Furthermore,
clustering may be framed to group users and Web pages according to their Web request pattern.
This may increase the access speed of Web log files. Discovery of frequent usage pattern
based on demographic aspects of user for Web data to better service the need of Web based
applications is a challenge before the researcher. Since different Web server has their own Web
log files, therefore focus on detailed analysis of different server’s log files may also contribute
substantial results in the area or research. Server logs are the place from where we can
understand the user behavior and the user’s approximation of the person’s next move. Web
Usage Mining (WUM) is a kind of data mining method that can be used to discover this behavior
of user and his/her access patterns from Web log data. This paper first presents an overview of
the used concepts and techniques of WUM to design Web recommender systems. Finally, the
researcher has proposed the following solutions which address these problems.
PROBLEM STATEMENT OF MARKOV MODEL
The system proposed here aims at a system which can exhibit an optimal system using
web logs files of web servers and thereby using it to help user navigation. The main objective of
our work is basically collecting the common browsing patterns.
WEB
LOGS
DATA
PREPROCESSING
PATTERN
DISCOVERY
PATTERN
ANALYSIS
Figure 2: Web Usage Mining Process
Web usage mining is the application of data mining techniques on web log containers to
discover knowledge about user behaviors. Websites statistics which is used to enhance the
performance and website design tasks. The ultimate source of web usage mining consists of
textual log files stored on numerous web servers all around the world. There are the following
steps in web usage
mining:Data Collection: Users log data is collected from different sources like web servers, proxy
servers, client side etc.
Data Preprocessing: It is a very important process in web usage mining. Here it performs data
reduction, user identification and session identification.
Pattern Discovery: Apply different data mining techniques like association rules, sequential
patterns; clustering and classification for identify the user’s pattern.
Pattern Analysis: Once uninterested rules are filtered out then analysis is done using query tools
like SQL to perform specific pattern analysis.
OBJECTIVE OF THE STUDY
The present study is going to portrait about the overview of web mining weblog files in
terms of behavioral patterns of web users using integrated markov model as a descriptive study
nature.
RESEARCH METHODOLGOY
In this present study, for modeling user behavior (navigation) on the Web, the use of
Markov models is a reasonable choice as they are compact, simple and based on well-established
theory. Several Markov models were proposed for modelling user Web data: first-order Markov
model, hybrid-order tree-like Markov model [10], prediction by partial match forest [7], kthorder Markov models [9], Recently, it was shown in [8] on large data set that it is better to use
the variable order Markov models for this purpose. Other, perhaps the most commonly used
techniques, are based on Hidden Markov Models (HMM). In [11] a hierarchical clustering
approach was proposed for decomposing users’ web sessions into non overlapping temporal
segments. In the experimental study it was shown that such temporal context can be identified
and used for more accurate next user action prediction with Markov models.
Users’ past browsing experience is vital in extracting log information. Researcher
proposes using Apriori clustering algorithm to group user behavior according to their web page
visits and then to predict the users’ next move from the applicable cluster. Prediction can be
visualized by using Markov model in association with the probability conditions. It is possible to
find the mathematical probability of the web users’ next click by examining the clusters together
with Markov model. The present focus is to improve the prediction accuracy by combining both
Markov model and clustering techniques as what is Integrating Markov model. It is by clustering
the webpage visits into groups according to web page visits made and then using the features of
the Markov model for doing the prediction using clusters of
This process may involve:
1. Preprocessing the Web server log files by grouping the users by using the Apriori
algorithm for clustering.
2. Decide on the number of clusters and group the Web usage sessions into clusters
3. Perform Markov model analysis on the whole data set.
4. For each item in the test data set, find the appropriate cluster the item belongs to.
5. Calculate Markov model accuracy using the cluster data as the training data set.
6. Calculate the total prediction accuracy based on clusters.
7. Compare the Markov model accuracy of the clusters to that of the whole data set.
Context-awareness of Integrated Markov model (IMM)
In this paper we present techniques for combining different Markov models so that, the
resulting model is less complex, has improved prediction perfection, and retains the legacies of
the All-Kth-Order Markov models. The primary idea of our work is that the complexities
involved with different order Markov models are eliminated without altering the performance of
the overall scheme.
Our experiments on a variety of data sets have shown that the proposed pruning schemes
consistently outperform the All-K th-Order Markov model and other single-order Markov
models, both in term of state complexity as well as improved prediction accuracy. Our
algorithms were developed in the context of web-usage data for predicting the users next page
visit.
HOW THIS PRESENT MARKOV MODEL WILL WORK?
The researcher has primarily preprocess (prune) the raw log data into individual user
session and judge the users next move by treating all sessions as training data so as it adjusts to
this model. Finally, researcher are used the Integrated Markov Model (IMM) to predict and
discover the users behavior. First researchers have collected the raw log of the server and do the
preprocessing steps. In the pre-process, we need the Web structure topology to acquire the users
session.
Server Log
Pre Processing
Start Session
End Session
Results of the
Prediction of
users next move
Training the
Integrated
Markov Model
Predictable
Integrated
Markov
Model(IMM)
Figure 3 : Process of the Integrated Markov Model(IMM)
After the data unit is processed, we have to construct a suitable model, which includes the
structure of parameters of the model. The precision of the model mainly depends on the state
transition diagrams and the associated parameters. Initially we randomly assign the model
parameters (transition probability A) a value between 0 and 1, satisfying that a11 + a12 = 1 and
a21 + a22 = 1 (see Figure 2). And the initial state probability distribution, π1 and π2 (π1 + π2 =
1). The researchers are used Baum-Welch algorithm to obtain a suitable parameter for them.
Then we use the parameters λ = (A, B, π) to discover the hidden states simultaneously when
consumers browse the website through the Apriori algorithm.
Using Apriori Algorithm for clustering the item sets for Prediction
The Apriori Algorithm is an influential algorithm for mining frequent item
sets for boolean association rules.
Key Concepts:
Frequent Item sets: The sets of item which has minimum support (denoted by Li for ithItem set).
Apriori Property: Any subset of frequent item set must be frequent.
Join Operation: To find Lk, a set of candidate k-item sets is generated by joining Lk-1with itself.
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-item set that is not frequent cannot be a subset of a frequent k-item set
Pseudo-code:
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1= {frequent items};
for(k= 1; Lk!=∅; k++) do begin
Ck+1= candidates generated from Lk;
for each transaction tin session do
increment the count of all candidates in Ck+1that are contained in t
Lk+1= candidates in Ck+1with min_support
end
return∪kLk;
As is common in association rule mining, given a set of item sets (for instance, sets of
pages visited), the algorithm attempts to find subsets which are common to at least a minimum
number C of the item sets. Apriori uses a "bottom up" approach, where frequent subsets are
extended one item at a time (a step known as candidate generation), and groups of candidates
are tested against the data. The algorithm terminates when no further successful extensions are
found. When a user is browsing the website, every click he did will be taken down by the server.
Our IMM can immediately predict what could be the next move by keeping in memory the
preference of sites visited and thereby predicting the next possible click.
Figure 4: An illustration of prediction of the users next click
Web usage mining is the application of data mining techniques on web log containers to discover
knowledge about user behaviors. Websites statistics which is used to enhance the performance
and website design tasks. The ultimate source of web usage mining consists of textual log files
stored on numerous web servers all around the world. There are the following steps in web usage
mining:Data Collection: Users log data is collected from different sources like web servers, proxy
servers, client side etc.
Data Preprocessing: It is a very important process in web usage mining. Here it performs data
reduction, user identification and session identification.
Pattern Discovery: Apply different data mining techniques like association rules, sequential
patterns, clustering and classification for identify the user’s pattern.
Pattern Analysis: Once uninterested rules are filtered out then analysis is done using query tools
like SQL to perform specific pattern analysis.
Experimental Evaluation of present study
The first step is to gather log files from Web servers, classify it into groups of clusters
and then do the prediction by using Apriori algorithm with the Integrated Markov model. For
evaluating the performance of the preprocessing algorithms, web log dataset from NASA
Kennedy Center Space (http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) was used. This
dataset from NASA Kennedy Center has the log entries which were collected during the period
from 01-07-1995 to 31-08-1995 which had used 373 MB storage space in an uncompressed
form. It had a total of 3,461,612 log entries. The performance measuring factors used for
evaluating the preprocessing algorithm is the percentage of reduction obtained on number of
transactions and the memory being used. The experiments also analyze the effect of these
algorithms on prediction of next web page. Precision (P) and recall (R) have been used to
measure the performance of information retrieval and information extraction systems. Precision
deals with substitution and insertion errors while recall deals with substitution and deletion
errors. The F-measure, has been defined as a weighted combination of P and R. The prediction
model used in this stage was proposed by Jalali et al. (2010) and is referred to as LPA (Longest
common sequence-based Prediction Algorithm).
RECOMMENDATION OF THE STUDY
While web users are extracting information from web logs through markov model, it will
facilitate the users to see the various further web sites and related information whatever, the web
user is searching as well as it will help to find out the expected web page to the web users
quickly. Therefore, this study should expand in near future for conducting applied research
related to this markov model for Munising the web browsing time and accuracy. There is wide
scope and opportunity is waiting to the web mining researcher to enhance this model in better.
Therefore, the software developer and other computer engineering profession must establish this
markov model in the field of data mining for reducing the present complication and challenges
with the help of experimental research study.
CONCLUSION
In this study, a usage navigation pattern prediction system is analysed. The system
consists of four stages. The first stage is the data collection, where log entries are collected from
web server, proxy server, client side etc. In the second stage, Data is preprocessed where
duplicate entries are removed. The result will then be used by the proposed IMM to predict
potential users. The researcher has observed that, this model will improve the overall accuracy of
prediction. Precision and recall are useful measures of performance for retrieving information
and extraction. Precision deals with substitution and insertion errors while recall deals with
substitution and deletion errors.
REFERENCES
1. Chun-Jung Lin Fan Wu, Han Chiu :Using Hidden Markov Model to Predict the Surfing
User’s Intention of Cyber Purchase on the Web.
2. Amit Pratap Singh, Dr. R. C. Jain:A Survey on Different Phases of Web Usage Mining
for Anomaly User Behavior Investigation
3. Priyanka Makkar et. al. / (IJCSE) International Journal on Computer Science and
Engineering Vol. 02, No. 04, 2010, 1233-1236
4. A. O. Alves and F. C.Pereira. Making sense of location context. 2012.
5. R. Begleiter, R. El-Yaniv, and G. Yona. On prediction using variable order markov
models. Journal of Artificial Intelligence Research (JAIR),22:385–421, 2004.
6. V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of
communities in large networks. Journal of Statistical Mechanics: Theory and Experiment,
10, 2008.
7. J. Borges and M. Levene. Evaluating variable-length markov chain models for analysis
of user web navigation sessions. IEEE Trans. Knowl. Data Eng, 2007.
8. X. Chen and X. Zhang. A popularity-based prediction model for web prefetching.
Computer, 36(6):63–70, 2003.
9. F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarl´os. Are web users really markovian?
In WWW, pages 609–618, 2012.
10. M. Deshpande and G. Karypis. Selective markov models for predicting web page
accesses. ACM Trans. Internet Techn. (TOIT), 4((2)):163–184, 2004.
11. X. Dongshan and S. Junyi. A new markov model for web access prediction. Computing
in Science and Engineering, 4(6):34–39, 2002.
12. J. Kiseleva, H. T. Lam, M. Pechenizkiy, and T. Calders. Discovering temporal hidden
contexts in web sessions for user trail prediction.
13. In Proceedings of the 22nd international conference on World Wide
14. Web, (Companion Volume, TempWeb@WWW’2013), pages 1067–1074.
15. X. Chen and X. Zhang. A popularity-based prediction model for web prefetching.
Computer, 36(6):63–70, 2003.
16. F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarl´os. Are web users really markovian?
In WWW, pages 609–618, 2012.
17. M. Deshpande and G. Karypis. Selective markov models for predicting web page
accesses. ACM Trans. Internet Techn. (TOIT), 4((2)):163–184,2004.
18. X. Dongshan and S. Junyi. A new markov model for web access prediction. Computing
in Science and Engineering, 4(6):34–39, 2002.
19. P.Saravana kumar/ R.Iswarya, International Journal of Latest Trends in Engineering and
Technology (IJLTET), Vol. 4 Issue 1 May 2014.