Download extraction of information from web server logs using nested

EXTRACTION OF INFORMATION FROM WEB SERVER LOGS USING NESTED CLUSTERING TECHNIQUE LAL SINGH TIWANA M.Tech Student, Dept. of Computer Engineering, Punjabi University, Patiala NAVJOT KAUR Asst. Prof., Dept. of Computer Engineering, Punjabi University, Patiala [email protected] [email protected] ABSTRACT: In the modern era, to imagine a human life without internet is quiet impossible. One can found knowledge almost about everything from the internet. The quantity of information available on the internet is beyond any limit. Almost all organizations are using internet for variety of tasks e.g. ecommerce etc .The main problem that common man faces is to find the relevant information from the huge amount of information available. Data is present in unstructured or semi structured form so only traditional data mining techniques are not much useful to obtain the required knowledge. For unstructured data web usage mining techniques are quite useful for analyzing the information. Web usage mining is the process of extracting useful knowledge from the server logs. Recommendation and predication from the extracted information is one of the most useful application of the web usage mining. This paper continues the line of research on Web access log analysis to analyze the usage patterns and the features of users behavior. In this paper a recommendation system is proposed that recommends the users the highly visited sites of the category selected by the user. This system works in two modes: offline mode and online mode. In offline mode two tasks are performed i.e. preprocessing of the log file and discovery of patterns using two level clustering. In the online mode, recommendation engine works to recommend the highly visited sites from the selected category. Keywords: Web usage mining, two level clustering, recommendation system. 1. INTRODUCTION Web usage mining is also called web log mining. It is web mining technique which is based upon the discovery and analysis of web usage patterns from web logs. Web server logs, proxy server logs, web browser logs, etc., are considered as web logs. The web logs allow the website administrators to identify the users, their location and their browsing patterns, etc. at their websites, i.e., it stores the information such as visitor’s IP address, referring website, timestamp, browser used, platform used, etc. The interesting information generated from these web logs helps the website administrators in effectively and efficiently serving the needs of the users visiting their websites. Web usage mining focuses on two different points: how the website administrators want their websites to be used by the users and how the users actually use these websites. The deviation of the actual use from them expected use can then be reduced by reorganizing and personalizing the websites according to actual needs of the users. Recommendation and predication from the extracted information is one of the most useful application of the web usage mining. Mostly users are not able to get exactly what information they need. Recommendation system helps the users in such situations In this paper a recommendation system is proposed that recommends the users the highly visited sites of the category selected by the user. The web log file used for the experiments in our proposed system is the Log file of a educational institute of cyberoam server of Punjabi university Patiala. The log file contains the records of the users and has the following fields: Time, User Name, User Group, Domain, URL, Category and IP Address. 2. RELEATED WORK: Hongzhou Sha, Tingwen Liu (2013), propose the method named EPLogCleaner that can filter out plenty of irrelevant items based on the common prefix of their URLs. EPLogCleaner consists of three stages. The first stage filter the files with suffix like .jpg, .png, .gif i.e multimedia files. In the second stage the the untraceable requests without human operations in the night are filtered out and in the stage requests automatically generated by the computer in the day time are filtered out. Mehrdad Jalali (2008), proposes a system for online prediction in Web Usage Mining. In this system a novel approach is used in which users browsing patterns are classified for predicting their future behaviour by using the LCS algorithm. By the proposed system the accuracy of the classification is improved. Nayana Mariya Varghese (2012), presents a method based on the fuzzy logic. In the proposed method fuzzy C Means algorithm is used for clustering in the web usage mining process. To obtain the less inter cluster similarity i.e cluster optimization Fuzzy Cluster Chase algorithm is used. Theint Theint Aye (2011), mainly focus on data preprocessing stage of the first phase of web usage mining. In this it performs the actions of field extraction and data cleaning algorithms. Field extraction algorithm is used for separate the fields from the single line of the log file. Data cleaning algorithm eliminates unnecessary or inconsistent items in the analyzed data. 3. METHODOLOGY The proposed system works in two modes: offline mode and online mode. In offline mode two tasks are performed i.e. preprocessing of the log file and discovery of patterns using two level clustering. In the online mode, recommendation engine works to recommend the highly visited sites from the selected category. The proposed framework is shown in figure 1. Web Log File Pre Processing Log File Cleaning Robots cleaning Identifying Multimedia requests Pattern Discovery First level Clustering Second Level Clustering Recommendation Engine Fig.1 Proposed Framework 3.1 Data Pretreatment For our proposed work, pretreatment, include data cleaning and robots removing, Identifying multimedia files. Log file cleaning: In this step,if the ‘category’ field of log file contains the ‘IPAddress’ then the entries corresponding to that are deleted from log file. Robots cleaning: Web robots’ requests, in the proposed model, are identified by the suffix “robots.txt” in the URL field. These entries are deleted from the log file and total number of robots requests deleted are counted. Identifying Multimedia requests: the request that contain the suffix ‘.gif’ or ‘.png’ or ‘.jpg’ in their URL field are multimedia files and all these requests are copied to a separate variable to keep the record of multimedia files. Algorithm for performing preprocessing is as following: Input: Log Table (LT) Output: Summarized Log Table (SLT) ‘*’ = access pages consist of embedded objects (i.e .jpg, .gif,.png, robots.txt ) Begin  Read records from log table.  Set countA= 0, countB=0  For each record perform the following steps:  Read fields (Category,URL_Link)  If category =‘IPAddress’ Then delete the whole row from the log table.  If suffix.URL_Link= * robot.txt Then delete the whole row from log table. And increase the countA by 1.  If suffix.URL_Link={*.jpg,*.png,*.gif} Then copy the whole row into another table. And remove suffix.URL_Link from log table and increase the countB by 1 End if End if  Else next record. End if End. 3.2 Pattern detection phase In the proposed model we will use two level clustering technique for discovering the different existing patterns. First level clustering: In the proposed model, the first level clustering is done on the category field of log file used. Steps to perform this level of clustering are following Perform follows:  Find unique number of categories from ‘category’ field of log table.  Select the 1st unique category  compare it with each record of ‘category’ field.  the records in which a match occurs, place the whole rows corresponding to those in a seprate cluster  Repeat the above step for each unique category.  Select the first cluster formed in step 2.  find the unique websites of ‘domain field’ in that cluster  calculate the frequency of each unique website  Repeat above step for each cluster. Second level of clustering: At this level clustering is performed within each category based on the number of frequencies of a particular web site that users are requesting. Within each category three clusters are formed i.e. highly visited, medium visited, low visited web sites. In order to perform this level of clustering within each category we use two threshold values T1 and T2 and a value say ‘V’ Algorithm for performing second level clustering: Input: clusters for unique categories formed at first level. Output: in each cluster further three clusters are formed. Begin   Take the cluster of 1st category Calculate V=(max-min)/3. T1=min+round(V). T2=T1+round(V). Where ‘max’ and ‘min’ are the maximum and minimum frequencies of unique websites in a particular category.  repeat step c for each unique website.  if min <= freq <= T1 then place that web site into low visited cluster  if T1 < freq <= T2 then place that web site into medium visited cluster  if T2 < freq <= max then place that web site into highly visited cluster end if end if end if End. 3.3 Recommendation Engine The main objective of this engine is to recommend users a list of suggestions of highly visited sites. This engine works at the server end. In this engine the user is provided with an options to choose a category from the drop down list. This list contains those unique categories which are already found in the first level of clustering from the log data. When the user selects one particular category, then recommendation engine selects the top visited sites from the three clusters formed under that particular category and recommend them to the users. After processing the log file in the proposed system we get the following information: Initial records Unwanted requests Robot requests Multimedia requests 4982 192 79 1368 Table 4.1 Information obtained after preprocessing After preprocessing the number of requests remaining is 4711. After preprocessing, in the pattern discovery phase two level clustering is performed. In the first level number of unique categories are found. In the log file used 31 unique categories are found which are shown in the figure 4.1. Fig 4.1 Unique categories Then within each category the second level clustering is done based on the frequency of unique web sites as shown in figure 4.2. 4. Experimental Results In order to evaluate the proposed system experiments were carried out on the log file of the cyberoam server of the Punjabi university, Patiala. The log file initially consists of 4982 entries. Fig 4.2 Second level clustering The figure 4.2. shows the three clusters formed under the Information technology category from our log file. Then recommendation engine collects the top 10 websites from the three clusters formed under the second level clustering of the category selected by the user and recommend to them. The following figure 4.3 shows the sites recommend to the user from the information technology by the recommendation engine. Fig 4.3 Recommendation engine 5. CONCLUSION This online recommendation system suggest the highly visited sites to the users based on the discovered patterns from all the users history rather than recommending based on the single user history. The proposed system allows the users to select the category from the available list and provided them the list of top 10 web sites under that category. The proposed system helps the server administration to easily analyse the usage of different websites from the second level of clustering. 6. Future Scope Future scope includes the consideration of time spent by user on a particular web site along with frequency of web sites to rank the web sites in the clusters. REFERENCES 1. Hongzhou Sha, Tingwen Liu (2013), “EPLogCleaner: Improving Data Quality of Enterprise Proxy Logs for Efficient Web Usage Mining” Information Technology and Quantitative Management , ITQM 2013. 2. Mehrdad Jalali (2008), “A Web Usage Mining Approach Based on LCS Algorithm in Online Predicting Recommendation Systems” 12th International Conference Information Visualisation. 3. Nayana Mariya Varghese (2012), “Cluster Optimization for Enhanced Web Usage Mining using Fuzzy Logic” 2012 World Congress on Information and Communication Technologies. 4. Theint Theint Aye (2011) ,“Web Log Cleaning for Mining of Web Usage Patterns ” IEEE 2011. 5. Sneha Y.S, Madhura Prakash (2011), “An Online Recommendation System Based On Web Usage Mining and Semantic Web Using LCS Algorithm” IEEE 2011. 6. Hiral Y. Modi, Meera Narvekar (2015), “Enhancement of online web recommendation system using a hybrid clustering and pattern matching approach” International Conference on Nascent Technologies in the Engineering Field (ICNTE-2015).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download extraction of information from web server logs using nested