Download 1435596563

AN INSIGHT IN TO WEB MINING IN PARTICULAR WEBLOG FILES TO UNDERSTAND AND PREDICT BEHAVIORAL PATTERNS OF WEB USERS USING INTEGRATED MARKOV MODEL  JOTHIS CHEMBATH, Ph.D. Research Scholar, Department of Computer Science, Karpagam University, Coimbatore, India. ** Dr.S.K.MAHENDRAN, Director, SVS Institute of Computer Applications, Coimbatore, India Abstract: Web server maintains log files listing every request made to the server. A log file is a file that contains a list of events including the user access date & time, username, user’s IP address, user’s requests, browser version, operating system of the accessing user, URLs of sites and page on sites that referred visitor to a particular page. These list of events contains a variety of information which may be further useful for identifying the pattern, trends and knowledge to better predict the behavior of the user; who logged on to the server. The present research paper is an attempt for identifying the major issues and challenges associated with web log files and the potentials of web mining for resolving such issues unveiling a novel approach in doing it. Finally, this study proposes the solutions which address these problems using Integrated Markov Model. KEYWORDS: Web usage mining, Weblog;, Pattern Discovery, APRIORI,MARKOV Model, Prefetching, WUM PREAMBLE: Web mining can be broadly defined as discovery and analysis useful information from the WWW. Based on the different emphasis and different ways to obtain information. Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site some users might be looking at only textual data, whereas some others might be interested in multimedia data. Source: Figure 1 International Journal of Latest Trends in Engineering and Technology (IJLTET) Vol. 4 Issue 1 May 2014. Web Mining on Web Log Files Web mining is one of the applications of data mining. When data mining is applied to the World Wide Web; the term data mining is replaced with Web mining. Clustering of web pages through structure mining provides the results. Web Mining can be used to gather, categorize, organize and provide the best possible information available on the web to the browser who requests the information. The other is Web usage mining (WUM), which focuses on analyzing visiting information from logged data in order to extract usage pattern, which can be classified into three categories: similar user group, relevant page group and frequency accessing path. Figure 1a: Web Log Processing for data analysis READ WEB SERVER LOG FILES A wealth of information about the activities of visitors is available from web server log files. Web server log file entries typically look similar to this: 212.209.212.66 - [13/Jun/2015:00:35:33 -0500] "GET /data-mining.htm HTTP/1.1" 200 11631 "http://internetmarketingengine.com/" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)". ISSUES AND CHALLENGES OF WEB LOG Since the Web log data is usually diverse and voluminous. This data must be assembled into a consistent, integrated and comprehensive view so that the web users can obtain the knowledge as and when required. The prospective issues and challenges are: Information retrieval is difficult from the huge data available in Web server. Thus, finding out the frequent URL’s on World Wide Web to give some substantial results for speeding up the Web access is an important issue. To increase the usability of Web sites and increase in speed of accessibility for such web sites is another prospect of research. Structuring web log data to better analyze the different attributes of Web log files, helps to understand the significance of attributes to have effective Web mining. After reviewing the web access history, it has been found that lot of inconsistency, incorrect and missing values are incorporating in Web log files. Thus, certain web mining preprocessing algorithms are needed to be applied or to propose new algorithms for efficient management of Web data. Furthermore, clustering may be framed to group users and Web pages according to their Web request pattern. This may increase the access speed of Web log files. Discovery of frequent usage pattern based on demographic aspects of user for Web data to better service the need of Web based applications is a challenge before the researcher. Since different Web server has their own Web log files, therefore focus on detailed analysis of different server’s log files may also contribute substantial results in the area or research. Server logs are the place from where we can understand the user behavior and the user’s approximation of the person’s next move. Web Usage Mining (WUM) is a kind of data mining method that can be used to discover this behavior of user and his/her access patterns from Web log data. This paper first presents an overview of the used concepts and techniques of WUM to design Web recommender systems. Finally, the researcher has proposed the following solutions which address these problems. PROBLEM STATEMENT OF MARKOV MODEL The system proposed here aims at a system which can exhibit an optimal system using web logs files of web servers and thereby using it to help user navigation. The main objective of our work is basically collecting the common browsing patterns. WEB LOGS DATA PREPROCESSING PATTERN DISCOVERY PATTERN ANALYSIS Figure 2: Web Usage Mining Process Web usage mining is the application of data mining techniques on web log containers to discover knowledge about user behaviors. Websites statistics which is used to enhance the performance and website design tasks. The ultimate source of web usage mining consists of textual log files stored on numerous web servers all around the world. There are the following steps in web usage mining:Data Collection: Users log data is collected from different sources like web servers, proxy servers, client side etc. Data Preprocessing: It is a very important process in web usage mining. Here it performs data reduction, user identification and session identification. Pattern Discovery: Apply different data mining techniques like association rules, sequential patterns; clustering and classification for identify the user’s pattern. Pattern Analysis: Once uninterested rules are filtered out then analysis is done using query tools like SQL to perform specific pattern analysis. OBJECTIVE OF THE STUDY The present study is going to portrait about the overview of web mining weblog files in terms of behavioral patterns of web users using integrated markov model as a descriptive study nature. RESEARCH METHODOLGOY In this present study, for modeling user behavior (navigation) on the Web, the use of Markov models is a reasonable choice as they are compact, simple and based on well-established theory. Several Markov models were proposed for modelling user Web data: first-order Markov model, hybrid-order tree-like Markov model [10], prediction by partial match forest [7], kthorder Markov models [9], Recently, it was shown in [8] on large data set that it is better to use the variable order Markov models for this purpose. Other, perhaps the most commonly used techniques, are based on Hidden Markov Models (HMM). In [11] a hierarchical clustering approach was proposed for decomposing users’ web sessions into non overlapping temporal segments. In the experimental study it was shown that such temporal context can be identified and used for more accurate next user action prediction with Markov models. Users’ past browsing experience is vital in extracting log information. Researcher proposes using Apriori clustering algorithm to group user behavior according to their web page visits and then to predict the users’ next move from the applicable cluster. Prediction can be visualized by using Markov model in association with the probability conditions. It is possible to find the mathematical probability of the web users’ next click by examining the clusters together with Markov model. The present focus is to improve the prediction accuracy by combining both Markov model and clustering techniques as what is Integrating Markov model. It is by clustering the webpage visits into groups according to web page visits made and then using the features of the Markov model for doing the prediction using clusters of This process may involve: 1. Preprocessing the Web server log files by grouping the users by using the Apriori algorithm for clustering. 2. Decide on the number of clusters and group the Web usage sessions into clusters 3. Perform Markov model analysis on the whole data set. 4. For each item in the test data set, find the appropriate cluster the item belongs to. 5. Calculate Markov model accuracy using the cluster data as the training data set. 6. Calculate the total prediction accuracy based on clusters. 7. Compare the Markov model accuracy of the clusters to that of the whole data set. Context-awareness of Integrated Markov model (IMM) In this paper we present techniques for combining different Markov models so that, the resulting model is less complex, has improved prediction perfection, and retains the legacies of the All-Kth-Order Markov models. The primary idea of our work is that the complexities involved with different order Markov models are eliminated without altering the performance of the overall scheme. Our experiments on a variety of data sets have shown that the proposed pruning schemes consistently outperform the All-K th-Order Markov model and other single-order Markov models, both in term of state complexity as well as improved prediction accuracy. Our algorithms were developed in the context of web-usage data for predicting the users next page visit. HOW THIS PRESENT MARKOV MODEL WILL WORK? The researcher has primarily preprocess (prune) the raw log data into individual user session and judge the users next move by treating all sessions as training data so as it adjusts to this model. Finally, researcher are used the Integrated Markov Model (IMM) to predict and discover the users behavior. First researchers have collected the raw log of the server and do the preprocessing steps. In the pre-process, we need the Web structure topology to acquire the users session. Server Log Pre Processing Start Session End Session Results of the Prediction of users next move Training the Integrated Markov Model Predictable Integrated Markov Model(IMM) Figure 3 : Process of the Integrated Markov Model(IMM) After the data unit is processed, we have to construct a suitable model, which includes the structure of parameters of the model. The precision of the model mainly depends on the state transition diagrams and the associated parameters. Initially we randomly assign the model parameters (transition probability A) a value between 0 and 1, satisfying that a11 + a12 = 1 and a21 + a22 = 1 (see Figure 2). And the initial state probability distribution, π1 and π2 (π1 + π2 = 1). The researchers are used Baum-Welch algorithm to obtain a suitable parameter for them. Then we use the parameters λ = (A, B, π) to discover the hidden states simultaneously when consumers browse the website through the Apriori algorithm. Using Apriori Algorithm for clustering the item sets for Prediction The Apriori Algorithm is an influential algorithm for mining frequent item sets for boolean association rules. Key Concepts: Frequent Item sets: The sets of item which has minimum support (denoted by Li for ithItem set). Apriori Property: Any subset of frequent item set must be frequent. Join Operation: To find Lk, a set of candidate k-item sets is generated by joining Lk-1with itself. The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-item set that is not frequent cannot be a subset of a frequent k-item set Pseudo-code: Ck: Candidate itemset of size k Lk: frequent itemset of size k L1= {frequent items}; for(k= 1; Lk!=∅; k++) do begin Ck+1= candidates generated from Lk; for each transaction tin session do increment the count of all candidates in Ck+1that are contained in t Lk+1= candidates in Ck+1with min_support end return∪kLk; As is common in association rule mining, given a set of item sets (for instance, sets of pages visited), the algorithm attempts to find subsets which are common to at least a minimum number C of the item sets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. When a user is browsing the website, every click he did will be taken down by the server. Our IMM can immediately predict what could be the next move by keeping in memory the preference of sites visited and thereby predicting the next possible click. Figure 4: An illustration of prediction of the users next click Web usage mining is the application of data mining techniques on web log containers to discover knowledge about user behaviors. Websites statistics which is used to enhance the performance and website design tasks. The ultimate source of web usage mining consists of textual log files stored on numerous web servers all around the world. There are the following steps in web usage mining:Data Collection: Users log data is collected from different sources like web servers, proxy servers, client side etc. Data Preprocessing: It is a very important process in web usage mining. Here it performs data reduction, user identification and session identification. Pattern Discovery: Apply different data mining techniques like association rules, sequential patterns, clustering and classification for identify the user’s pattern. Pattern Analysis: Once uninterested rules are filtered out then analysis is done using query tools like SQL to perform specific pattern analysis. Experimental Evaluation of present study The first step is to gather log files from Web servers, classify it into groups of clusters and then do the prediction by using Apriori algorithm with the Integrated Markov model. For evaluating the performance of the preprocessing algorithms, web log dataset from NASA Kennedy Center Space (http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) was used. This dataset from NASA Kennedy Center has the log entries which were collected during the period from 01-07-1995 to 31-08-1995 which had used 373 MB storage space in an uncompressed form. It had a total of 3,461,612 log entries. The performance measuring factors used for evaluating the preprocessing algorithm is the percentage of reduction obtained on number of transactions and the memory being used. The experiments also analyze the effect of these algorithms on prediction of next web page. Precision (P) and recall (R) have been used to measure the performance of information retrieval and information extraction systems. Precision deals with substitution and insertion errors while recall deals with substitution and deletion errors. The F-measure, has been defined as a weighted combination of P and R. The prediction model used in this stage was proposed by Jalali et al. (2010) and is referred to as LPA (Longest common sequence-based Prediction Algorithm). RECOMMENDATION OF THE STUDY While web users are extracting information from web logs through markov model, it will facilitate the users to see the various further web sites and related information whatever, the web user is searching as well as it will help to find out the expected web page to the web users quickly. Therefore, this study should expand in near future for conducting applied research related to this markov model for Munising the web browsing time and accuracy. There is wide scope and opportunity is waiting to the web mining researcher to enhance this model in better. Therefore, the software developer and other computer engineering profession must establish this markov model in the field of data mining for reducing the present complication and challenges with the help of experimental research study. CONCLUSION In this study, a usage navigation pattern prediction system is analysed. The system consists of four stages. The first stage is the data collection, where log entries are collected from web server, proxy server, client side etc. In the second stage, Data is preprocessed where duplicate entries are removed. The result will then be used by the proposed IMM to predict potential users. The researcher has observed that, this model will improve the overall accuracy of prediction. Precision and recall are useful measures of performance for retrieving information and extraction. Precision deals with substitution and insertion errors while recall deals with substitution and deletion errors. REFERENCES 1. Chun-Jung Lin Fan Wu, Han Chiu :Using Hidden Markov Model to Predict the Surfing User’s Intention of Cyber Purchase on the Web. 2. Amit Pratap Singh, Dr. R. C. Jain:A Survey on Different Phases of Web Usage Mining for Anomaly User Behavior Investigation 3. Priyanka Makkar et. al. / (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 04, 2010, 1233-1236 4. A. O. Alves and F. C.Pereira. Making sense of location context. 2012. 5. R. Begleiter, R. El-Yaniv, and G. Yona. On prediction using variable order markov models. Journal of Artificial Intelligence Research (JAIR),22:385–421, 2004. 6. V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, 2008. 7. J. Borges and M. Levene. Evaluating variable-length markov chain models for analysis of user web navigation sessions. IEEE Trans. Knowl. Data Eng, 2007. 8. X. Chen and X. Zhang. A popularity-based prediction model for web prefetching. Computer, 36(6):63–70, 2003. 9. F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarl´os. Are web users really markovian? In WWW, pages 609–618, 2012. 10. M. Deshpande and G. Karypis. Selective markov models for predicting web page accesses. ACM Trans. Internet Techn. (TOIT), 4((2)):163–184, 2004. 11. X. Dongshan and S. Junyi. A new markov model for web access prediction. Computing in Science and Engineering, 4(6):34–39, 2002. 12. J. Kiseleva, H. T. Lam, M. Pechenizkiy, and T. Calders. Discovering temporal hidden contexts in web sessions for user trail prediction. 13. In Proceedings of the 22nd international conference on World Wide 14. Web, (Companion Volume, TempWeb@WWW’2013), pages 1067–1074. 15. X. Chen and X. Zhang. A popularity-based prediction model for web prefetching. Computer, 36(6):63–70, 2003. 16. F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarl´os. Are web users really markovian? In WWW, pages 609–618, 2012. 17. M. Deshpande and G. Karypis. Selective markov models for predicting web page accesses. ACM Trans. Internet Techn. (TOIT), 4((2)):163–184,2004. 18. X. Dongshan and S. Junyi. A new markov model for web access prediction. Computing in Science and Engineering, 4(6):34–39, 2002. 19. P.Saravana kumar/ R.Iswarya, International Journal of Latest Trends in Engineering and Technology (IJLTET), Vol. 4 Issue 1 May 2014.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 1435596563