Download extraction of information from web server logs using nested

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
EXTRACTION OF INFORMATION FROM WEB SERVER LOGS USING
NESTED CLUSTERING TECHNIQUE
LAL SINGH TIWANA
M.Tech Student, Dept. of Computer Engineering,
Punjabi University, Patiala
NAVJOT KAUR
Asst. Prof., Dept. of Computer Engineering,
Punjabi University, Patiala
[email protected]
[email protected]
ABSTRACT: In the modern era, to
imagine a human life without internet is
quiet impossible. One can found
knowledge almost about everything from
the internet. The quantity of information
available on the internet is beyond any
limit. Almost all organizations are using
internet for variety of tasks e.g. ecommerce etc .The main problem that
common man faces is to find the relevant
information from the huge amount of
information available. Data is present in
unstructured or semi structured form so
only traditional data mining techniques
are not much useful to obtain the
required knowledge. For unstructured
data web usage mining techniques are
quite
useful
for
analyzing
the
information. Web usage mining is the
process of extracting useful knowledge
from the server logs. Recommendation
and predication from the extracted
information is one of the most useful
application of the web usage mining. This
paper continues the line of research on
Web access log analysis to analyze the
usage patterns and the features of users
behavior. In this paper a recommendation
system is proposed that recommends the
users the highly visited sites of the
category selected by the user. This system
works in two modes: offline mode and
online mode. In offline mode two tasks
are performed i.e. preprocessing of the
log file and discovery of patterns using
two level clustering. In the online mode,
recommendation
engine
works
to
recommend the highly visited sites from
the selected category.
Keywords: Web usage mining, two level
clustering, recommendation system.
1. INTRODUCTION
Web usage mining is also called web log
mining. It is web mining technique which is
based upon the discovery and analysis of
web usage patterns from web logs. Web
server logs, proxy server logs, web browser
logs, etc., are considered as web logs. The
web logs allow the website administrators to
identify the users, their location and their
browsing patterns, etc. at their websites, i.e.,
it stores the information such as visitor’s IP
address, referring website, timestamp,
browser used, platform used, etc. The
interesting information generated from these
web logs helps the website administrators in
effectively and efficiently serving the needs
of the users visiting their websites. Web
usage mining focuses on two different
points: how the website administrators want
their websites to be used by the users and
how the users actually use these websites.
The deviation of the actual use from them
expected use can then be reduced by
reorganizing and personalizing the websites
according to actual needs of the users.
Recommendation and predication from the
extracted information is one of the most
useful application of the web usage mining.
Mostly users are not able to get exactly what
information they need. Recommendation
system helps the users in such situations In
this paper a recommendation system is
proposed that recommends the users the
highly visited sites of the category selected
by the user.
The web log file used for the experiments in
our proposed system is the Log file of a
educational institute of cyberoam server of
Punjabi university Patiala. The log file
contains the records of the users and has the
following fields: Time, User Name, User
Group, Domain, URL, Category and IP
Address.
2. RELEATED WORK:
Hongzhou Sha, Tingwen Liu (2013),
propose the method named EPLogCleaner
that can filter out plenty of irrelevant items
based on the common prefix of their URLs.
EPLogCleaner consists of three stages. The
first stage filter the files with suffix like .jpg,
.png, .gif i.e multimedia files. In the second
stage the the untraceable requests without
human operations in the night are filtered
out and in the stage requests automatically
generated by the computer in the day time
are filtered out.
Mehrdad Jalali (2008), proposes a system
for online prediction in Web Usage Mining.
In this system a novel approach is used in
which users browsing patterns are classified
for predicting their future behaviour by
using the LCS algorithm. By the proposed
system the accuracy of the classification is
improved.
Nayana Mariya Varghese (2012), presents a
method based on the fuzzy logic. In the
proposed method fuzzy C Means algorithm
is used for clustering in the web usage
mining process. To obtain the less inter
cluster similarity i.e cluster optimization
Fuzzy Cluster Chase algorithm is used.
Theint Theint Aye (2011), mainly focus on
data preprocessing stage of the first phase of
web usage mining. In this it performs the
actions of field extraction and data cleaning
algorithms. Field extraction algorithm is
used for separate the fields from the single
line of the log file. Data cleaning algorithm
eliminates unnecessary or inconsistent items
in the analyzed data.
3. METHODOLOGY
The proposed system works in two modes:
offline mode and online mode. In offline
mode two tasks are performed i.e.
preprocessing of the log file and discovery
of patterns using two level clustering. In the
online mode, recommendation engine works
to recommend the highly visited sites from
the selected category. The proposed
framework is shown in figure 1.
Web Log File
Pre Processing
Log File Cleaning
Robots cleaning
Identifying Multimedia
requests
Pattern Discovery
First level Clustering
Second Level Clustering
Recommendation Engine
Fig.1 Proposed Framework
3.1 Data Pretreatment
For our proposed work, pretreatment,
include data cleaning and robots removing,
Identifying multimedia files.
Log file cleaning: In this step,if the
‘category’ field of log file contains the
‘IPAddress’ then the entries corresponding
to that are deleted from log file.
Robots cleaning: Web robots’ requests, in
the proposed model, are identified by the
suffix “robots.txt” in the URL field. These
entries are deleted from the log file and total
number of robots requests deleted are
counted.
Identifying Multimedia requests: the
request that contain the suffix ‘.gif’ or ‘.png’
or ‘.jpg’ in their URL field are multimedia
files and all these requests are copied to a
separate variable to keep the record of
multimedia files.
Algorithm for performing preprocessing is
as following:
Input: Log Table (LT)
Output: Summarized Log Table (SLT)
‘*’ = access pages consist of embedded
objects (i.e .jpg, .gif,.png, robots.txt )
Begin
 Read records from log table.
 Set countA= 0, countB=0
 For each record perform the following
steps:
 Read fields (Category,URL_Link)
 If category =‘IPAddress’
Then delete the whole row from the
log
table.
 If suffix.URL_Link= * robot.txt
Then delete the whole row from log
table.
And increase the countA by 1.
 If
suffix.URL_Link={*.jpg,*.png,*.gif}
Then copy the whole row into another
table.
And remove suffix.URL_Link from
log table and increase the countB by 1
End if
End if
 Else next record.
End if
End.
3.2 Pattern detection phase
In the proposed model we will use two level
clustering technique for discovering the
different existing patterns.
First level clustering: In the proposed
model, the first level clustering is done on
the category field of log file used. Steps to
perform this level of clustering are following
Perform follows:
 Find unique number of categories from
‘category’ field of log table.
 Select the 1st unique category
 compare it with each record of
‘category’ field.
 the records in which a match occurs,
place the whole rows corresponding
to those in a seprate cluster
 Repeat the above step for each unique
category.
 Select the first cluster formed in step 2.
 find the unique websites of ‘domain
field’ in that cluster
 calculate the frequency of each
unique website
 Repeat above step for each cluster.
Second level of clustering: At this level
clustering is performed within
each
category based on the number of frequencies
of a particular web site that users are
requesting. Within each category three
clusters are formed i.e. highly visited,
medium visited, low visited web sites.
In order to perform this level of clustering
within each category we use two threshold
values T1 and T2 and a value say ‘V’
Algorithm for performing second level
clustering:
Input: clusters for unique categories formed
at first level.
Output: in each cluster further three clusters
are formed.
Begin


Take the cluster of 1st category
Calculate
V=(max-min)/3.
T1=min+round(V).
T2=T1+round(V).
Where ‘max’ and ‘min’ are the maximum
and minimum frequencies of unique
websites in a particular category.
 repeat step c for each unique
website.
 if min <= freq <= T1
then place that web site into low
visited cluster
 if T1 < freq <= T2
then place that web site into
medium visited cluster
 if T2 < freq <= max
then place that web site into highly
visited cluster
end if
end if
end if
End.
3.3 Recommendation Engine
The main objective of this engine is to
recommend users a list of suggestions of
highly visited sites. This engine works at the
server end. In this engine the user is
provided with an options to choose a
category from the drop down list. This list
contains those unique categories which are
already found in the first level of clustering
from the log data. When the user selects one
particular category, then recommendation
engine selects the top visited sites from the
three clusters formed under that particular
category and recommend them to the users.
After processing the log file in the proposed
system we get the following information:
Initial records
Unwanted requests
Robot requests
Multimedia requests
4982
192
79
1368
Table 4.1 Information obtained after preprocessing
After preprocessing the number of requests
remaining is 4711.
After preprocessing, in the pattern discovery
phase two level clustering is performed. In
the first level number of unique categories
are found. In the log file used 31 unique
categories are found which are shown in the
figure 4.1.
Fig 4.1 Unique categories
Then within each category the second level
clustering is done based on the frequency of
unique web sites as shown in figure 4.2.
4. Experimental Results
In order to evaluate the proposed system
experiments were carried out on the log file
of the cyberoam server of the Punjabi
university, Patiala.
The log file initially consists of 4982 entries.
Fig 4.2 Second level clustering
The figure 4.2. shows the three clusters
formed under the Information technology
category from our log file.
Then recommendation engine collects the
top 10 websites from the three clusters
formed under the second level clustering of
the category selected by the user and
recommend to them. The following figure
4.3 shows the sites recommend to the user
from the information technology by the
recommendation engine.
Fig 4.3 Recommendation engine
5. CONCLUSION
This online recommendation system suggest
the highly visited sites to the users based on
the discovered patterns from all the users
history rather than recommending based on
the single user history. The proposed system
allows the users to select the category from
the available list and provided them the list
of top 10 web sites under that category. The
proposed system helps the server
administration to easily analyse the usage of
different websites from the second level of
clustering.
6. Future Scope
Future scope includes the consideration of
time spent by user on a particular web site
along with frequency of web sites to rank
the web sites in the clusters.
REFERENCES
1. Hongzhou Sha, Tingwen Liu (2013),
“EPLogCleaner: Improving Data Quality of
Enterprise Proxy Logs for Efficient Web
Usage Mining” Information Technology and
Quantitative Management , ITQM 2013.
2. Mehrdad Jalali (2008), “A Web Usage
Mining Approach Based on LCS Algorithm
in Online Predicting Recommendation
Systems” 12th International Conference
Information Visualisation.
3. Nayana Mariya Varghese (2012), “Cluster
Optimization for Enhanced Web Usage
Mining using Fuzzy Logic” 2012 World
Congress
on
Information
and
Communication Technologies.
4. Theint Theint Aye (2011) ,“Web Log
Cleaning for Mining of Web Usage Patterns
” IEEE 2011.
5. Sneha Y.S, Madhura Prakash (2011),
“An Online Recommendation System Based
On Web Usage Mining and Semantic Web
Using LCS Algorithm” IEEE 2011.
6. Hiral Y. Modi, Meera Narvekar (2015),
“Enhancement
of
online
web
recommendation system using a hybrid
clustering and pattern matching approach”
International Conference on Nascent
Technologies in the Engineering Field
(ICNTE-2015).