Download A Session Identification Algorithm Based on Frame Page and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL redirection wikipedia , lookup

Transcript
A Session Identification Algorithm Based on Frame Page and Pagethreshold
Fang Yuankang1,2, Huang Zhiqiu1
1. Information Science and Technology School
2.Computer Department
Nanjing University of Aeronautics and Astronautics
Chizhou College
Nanjing, China
Chizhou, China
E-mail:[email protected]
Abstract--Session identification is an important step in data
processing of web log mining. To solve the defects in
traditional session identification, an improved session
identification algorithm was proposed. After identifying
specific users, a great deal of frame pages were filtered, the
relatively reasonable access time threshold for each page was
made up according to contents of each page and all web
structure and user’s session sets were identified by this
threshold. Finally the algorithm was compared with the
traditional methods of session identification by experiences,
the higher rationality and effectiveness of it was proved.
II. IMPROVED SESSION IDENTIFICATION
Two steps of the session identification: first, after
combing out the data, filter the frame page. Second,
combine the contents of each page and all web structure
---according to the importance of the frame page; construct
the relatively actual session aggregation.
A.
Data processing before the session identification
The aim of data filtering is to filter the irrelevant Log
data entry, it includes the failed data requested by the
Logging users, or conducted by the automatic research
Agent or Spider and the Video files, including images of
log data image documents. User identification refers to
identify specific user, using IP address, the agent and the
temporary information marked together only a user. After
identifying the user, the main work is to filter a large
number of page frames.
HTML specification supports multi-window page by
"Frame" tag. The page loaded in the window corresponds
with a URL. It must be known that Subframe page may be
the frame page which includes its sub-windows, for
example, the relationship of A,B,C is that A is Frame
page, B and C are the Subframs of A, but B is not only the
Frame page, but also the Subframe page. Frame page and
subframe page always appear in the session at the same
time. In the dig test of Chizhou College Web server logs,
lots of Frame and Subframe pages make the session
identification deviated from the truth. The aim of Web
mining is to discover the unknown form of the users, and
the relationship between the Frame and Subframe pages is
given, so the subframe’s impact on the session
identification should be eliminated. The Frame and
Subframe pages are used as the multi-windows, so they
should be considered as Considered as a whole, that means
the require of the user is the require for multi-window page.
Considering the overall situation, this way can dispel the
influence which given to the session identification, and
improve the interesting of Web mining.
Web mining ˗ Data preprocessing ˗ Threshold ˗ Frame
page˗Session identification
I. INTRODUCTION
Web log mining[1] is to discover the mode of users’
accessing to web page through mining web logs. In the
process, the designer’ knowledge fields, the rate of his
interesting and the users’ visiting habit can be refined,
which can optimize the site’s structure, develop individual
service and the control of the users that is useful strategies
information for the designers and the managers. The most
important and time-consuming link in mining web logs is
the session identification in web log preprocessing. The
user’s session is a session aggregation covering more than
one web services. The aim of session identification is to
divide the users’ page into an isolated identification. There
are two main methods of session identification: timeout
method[2] and maximal forward references [3]
The traditional session identification has two defects:
on the one hand, the records in the same session may be
divided into different sessions or the records in the different
session may be divided into the same sessions on the other
hand, the number of the effective frame page is more than
that of the actual ones, it decrease the efficiency of the
session identification. There are lots of errors between the
session aggregation generated under the traditional session
identification and the actual ones. And the errors affect the
reality of the session identification deeply. According to the
frame page and the importance of page, an improved
session identification algorithm was proposed. It is proved
by experience that this session identification algorithm can
make the session be more authentic and raise the efficiency
of the session identification.
B.
Construct the threshold G of page access time
according to importance of the page
When construct the threshold G of page access time, it is
necessary to think of the importance of the page and the
affect that the page frame have on the threshold
_____________________________________
978-1-4244-5539-3/10/$26.00 ©2010 IEEE
645
G
of page
access time.
Definition 1: linking content ratio˄RLCR˅ refers to ratio of
page content to the amount of page linking. While the size
of page is SDSˈthe calculation formula of RLCR is as follows:
RLCR =(LI+LO)/ SDS
The calculating steps are as follows:
First, according to the log file statistics, get the
collection of t- St, classify the users file, it means to get the
St by identifying the users and counting the t of the visiting.
Then, According to the value adjustment S and , gain
the page access time threshold set S, according to formula
(4)and the combination of and S, calculate the result of
S;
Finally, According to S, reclassify user's session log file
set.
(1)
Here the amount of link-in which is recorded as LI
refers to the amount of pages linked to some page; the
amount of link-out which is recorded as LO means the
amount of linking included by some page.
If the size of one page, SDS, is 3Kbyte, its LI 2 and its LO
4, the result of RLCR will be 2. Usually, the link-in is more
important than the link-out, so a weighting adjustment on
them is needed. In this paper, the hypothesis on the
weighting ratio of link-in and link-out is golden mean. The
calculation formula of RLCR is adjusted to the following:
RLCR =2(0.618 LI +0.382 LO)/ SDS
III.
The experimental data is from Chizhou College Web site:
211.86.192.12, Server log data for the March 29, 2009 to
July 4. In the experiment, it makes a comparison on the
result of four session identification methods simultaneously,
namely session identification method based on citation
( method 1), session identification method based on fixed
time threshold ten minutes(method 2), session identification
(2)
method based on page access time threshold G (method 3),
session identification method based on page access time
threshold and session reconstruction (method 4), and the
comparing among these four types of session identification
algorithm.
The present popular evaluation standard[5] is applied in
this paper: the degree of session complete reconstructed by
arithmetic h. Usually, two indexes, precision degree and
recall ratio, are taken to measure the degree of
reconstruction. Precision is the ratio of number of complete
constructed sessions to the number of total sessions got
through construction: precision(h)=|RhR|/|Rh|. recall ratio
is the ratio of number of complete constructed sessions to
the number of real sessions: recall(h)=|RhR|/|R|. Data of
the experiment is shown in table 1, while making
comparison of all arithmetic takes the method based on
citation as benchmark.
G,
In order to apply it to the adjustment of the threshold
the value of RLCR is needed to be mapped in (0,1) with
many mapping means available. For example, the ratio of
RLCR value to the maximal among all RLCR value can be
mapped in (0,1). However, it tends to be influenced by
some isolated point. When the RLCR value of some page is
very big, it will influence other points. The following
mapping method is applied in this paper, with the influence
factor of RLCR to
Definition 2:
E
G
being E .
is the influence factor of page RLCR to
access time threshold G , with its calculation formula[4] as
E
=1-exp(-RLCR). But this change to numbers while
RLCR >20, E is always 1. In other words, all numbers
over 20 will be 1 while they are mapped in (0,1), which
means it is impossible to distinguish them in the section of
(0,1). Considering it, the formula is updated to the formula
(3) so that the accuracy of mapping will be improved much
as the upper limit to distinguish RLCR can reach nearly 500.
E =1-exp(-sqrt(sqrt(R
LCR)))
TABLE 1- OUTCOME OF COMPARISON ON ALL METHODS OF
SESSION IDENTIFICATION
Methods
of
session
construction
based
on
citation(R)
based on fixed
time
threshold(T)
based on page
access
time
threshold(A)
based on page
access
time
threshold and
session
reconstruction
(3)
Taking the above adjustment into account, the
calculation formula is as follows:
G = D t(1+ E ).
EXPERIMENTAL RESULTS
(4)
According to the threshold of page time,
structure the set of user session set
To set the access time threshold of each page, the
actual page statistics access time t must be got first,
combining the influence factor of the page RLCR-,
adjust . The access time t is recorded as St = (t1, t2, ..., tn),
the influence factor is recorded as S = (1, 2, ..., n).
the
efficient
pages
161026
|R|=27130
161026
|T|=46150
161026
93980
amount of
sessions
amount of
session
intersection
|RR|
=27130
|TR|
=13004
precision%
recall
ratio%
|RR|/|R|
=100
|TR|/|T|
=28.178
|RR|/|R|
=100
|TR|/|R|
=47.932
| A|=58956
|AR|
=20307
|AR|/|A|
=34.444
|AR|/|R|
=74.851
|FA|=57705
|FAR|
=21305
|FAR|/|FA|
=36.921
|FAR|/|R|
=78.529
C.
TABLE 2 -STATISTICS ON PAGE NUMBERS BASED ON METHOD 4
the efficient pages
the total page number
30136
G
>10min
5965
G
<5min
21107
G
<3min
12083
G
<1min
8075
From Table 1 and Table 2, the following conclusions can
646
be drawn
From Table 1, in terms of accuracy or degree from the
recall, the method 3 and method 4 are similar, but they are
higher than the second method. Therefore, the function of
the latter two methods is better than methods 2. However,
according to the analysis of algorithm efficiency, as method
4 reduce the efficient pages by filtering the pages, thus
improves the efficiency of the session identification, the
efficiency of identification through method 4 is higher than
any of the others. From Table 2, although the page
threshold is adjusted, it may still be too big or too small,
so when >10min,it is designed as 10min, when <60S, it is
designed as 60S. Contrast to the traditional threshold 10min,
the improved values significantly reduced (for example:
when <5min, the corresponding page number makes up
21107/30136 = 70.039%) of the total number, it reflects the
user's browsing habits better, so it makes more accurate
classification of the session. Contrast to the fixed session
length recognition, the improved method can identify long
sessions. Because the maximum length of the fixed
identification session is up to , usually it is designed as
25.5 min, and practically, there will be more long sessions,
and the improved method can identify these long sessions.
existing data pre-processing technology, the paper argues
an improved method that is to eliminate the impact of
mining results, then use the page access time threshold to
identify the session. Testing by the experimental data of the
test, and comparing with the traditional session
identification method, the improved session identification
not only improves the authenticity but also the efficiency of
the session identification,
ACKNOWLEDGMENT
The authors would like to thank the anonymous referees
for the useful suggestions for improving this paper. This
project is supported by the National High-Tech Research
and Development Plan of China under Grant No.
2009AA010307
REFERENCE
[1]
[2]
[3]
CONCLUSION
IV.
According to the introduction of the session identification,
it comes to the conclusion that exist of the Frame page
influence the authenticity and efficiency of the session
identification. The traditional session identification method
adapts the fixed time threshold, which may make the
session aggregation deviate the real session. Based on the
[4]
[5]
647
Han Jia-Wei ,Meng Xiao-Feng, Ang Jing. Research on Web
Mining[J]. Journal of Computer Research & Development, 2001,38
(4):405 - 414.
ZHANG H. Y. LIANG W. A. An intelligent algorithm of data
pre-processing in web usage mining[Z]. Intelligent Control and
Automation, WCICA 2004, Fifth World Congress, 4:3119-3123.
FANG Y, WANG L. J., GE Y. Study on data preprocessing
algorithm in web log mining. Machine Learning and Cybernetics[C].
2003 International Conference, 2003,1:28-32.
Yin Xianliang, Zhang Wei. An Improved Session Identification
Method in Web Mining [J].Huazhong University of Science and
Technology Journal(natural science edition),2006,7:33-35
Fu Y,Sandhu K,Shih M. A generalization-Based Approach to
Clustering of Web Usage Session C ĠProc 1999 KDD Workshop
Web Mining, LNCS 1863. s.l. :S pringer-Verlag,2000:21 - 28.