Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Session Identification Algorithm Based on Frame Page and Pagethreshold Fang Yuankang1,2, Huang Zhiqiu1 1. Information Science and Technology School 2.Computer Department Nanjing University of Aeronautics and Astronautics Chizhou College Nanjing, China Chizhou, China E-mail:[email protected] Abstract--Session identification is an important step in data processing of web log mining. To solve the defects in traditional session identification, an improved session identification algorithm was proposed. After identifying specific users, a great deal of frame pages were filtered, the relatively reasonable access time threshold for each page was made up according to contents of each page and all web structure and user’s session sets were identified by this threshold. Finally the algorithm was compared with the traditional methods of session identification by experiences, the higher rationality and effectiveness of it was proved. II. IMPROVED SESSION IDENTIFICATION Two steps of the session identification: first, after combing out the data, filter the frame page. Second, combine the contents of each page and all web structure ---according to the importance of the frame page; construct the relatively actual session aggregation. A. Data processing before the session identification The aim of data filtering is to filter the irrelevant Log data entry, it includes the failed data requested by the Logging users, or conducted by the automatic research Agent or Spider and the Video files, including images of log data image documents. User identification refers to identify specific user, using IP address, the agent and the temporary information marked together only a user. After identifying the user, the main work is to filter a large number of page frames. HTML specification supports multi-window page by "Frame" tag. The page loaded in the window corresponds with a URL. It must be known that Subframe page may be the frame page which includes its sub-windows, for example, the relationship of A,B,C is that A is Frame page, B and C are the Subframs of A, but B is not only the Frame page, but also the Subframe page. Frame page and subframe page always appear in the session at the same time. In the dig test of Chizhou College Web server logs, lots of Frame and Subframe pages make the session identification deviated from the truth. The aim of Web mining is to discover the unknown form of the users, and the relationship between the Frame and Subframe pages is given, so the subframe’s impact on the session identification should be eliminated. The Frame and Subframe pages are used as the multi-windows, so they should be considered as Considered as a whole, that means the require of the user is the require for multi-window page. Considering the overall situation, this way can dispel the influence which given to the session identification, and improve the interesting of Web mining. Web mining ˗ Data preprocessing ˗ Threshold ˗ Frame page˗Session identification I. INTRODUCTION Web log mining[1] is to discover the mode of users’ accessing to web page through mining web logs. In the process, the designer’ knowledge fields, the rate of his interesting and the users’ visiting habit can be refined, which can optimize the site’s structure, develop individual service and the control of the users that is useful strategies information for the designers and the managers. The most important and time-consuming link in mining web logs is the session identification in web log preprocessing. The user’s session is a session aggregation covering more than one web services. The aim of session identification is to divide the users’ page into an isolated identification. There are two main methods of session identification: timeout method[2] and maximal forward references [3] The traditional session identification has two defects: on the one hand, the records in the same session may be divided into different sessions or the records in the different session may be divided into the same sessions on the other hand, the number of the effective frame page is more than that of the actual ones, it decrease the efficiency of the session identification. There are lots of errors between the session aggregation generated under the traditional session identification and the actual ones. And the errors affect the reality of the session identification deeply. According to the frame page and the importance of page, an improved session identification algorithm was proposed. It is proved by experience that this session identification algorithm can make the session be more authentic and raise the efficiency of the session identification. B. Construct the threshold G of page access time according to importance of the page When construct the threshold G of page access time, it is necessary to think of the importance of the page and the affect that the page frame have on the threshold _____________________________________ 978-1-4244-5539-3/10/$26.00 ©2010 IEEE 645 G of page access time. Definition 1: linking content ratio˄RLCR˅ refers to ratio of page content to the amount of page linking. While the size of page is SDSˈthe calculation formula of RLCR is as follows: RLCR =(LI+LO)/ SDS The calculating steps are as follows: First, according to the log file statistics, get the collection of t- St, classify the users file, it means to get the St by identifying the users and counting the t of the visiting. Then, According to the value adjustment S and , gain the page access time threshold set S, according to formula (4)and the combination of and S, calculate the result of S; Finally, According to S, reclassify user's session log file set. (1) Here the amount of link-in which is recorded as LI refers to the amount of pages linked to some page; the amount of link-out which is recorded as LO means the amount of linking included by some page. If the size of one page, SDS, is 3Kbyte, its LI 2 and its LO 4, the result of RLCR will be 2. Usually, the link-in is more important than the link-out, so a weighting adjustment on them is needed. In this paper, the hypothesis on the weighting ratio of link-in and link-out is golden mean. The calculation formula of RLCR is adjusted to the following: RLCR =2(0.618 LI +0.382 LO)/ SDS III. The experimental data is from Chizhou College Web site: 211.86.192.12, Server log data for the March 29, 2009 to July 4. In the experiment, it makes a comparison on the result of four session identification methods simultaneously, namely session identification method based on citation ( method 1), session identification method based on fixed time threshold ten minutes(method 2), session identification (2) method based on page access time threshold G (method 3), session identification method based on page access time threshold and session reconstruction (method 4), and the comparing among these four types of session identification algorithm. The present popular evaluation standard[5] is applied in this paper: the degree of session complete reconstructed by arithmetic h. Usually, two indexes, precision degree and recall ratio, are taken to measure the degree of reconstruction. Precision is the ratio of number of complete constructed sessions to the number of total sessions got through construction: precision(h)=|RhR|/|Rh|. recall ratio is the ratio of number of complete constructed sessions to the number of real sessions: recall(h)=|RhR|/|R|. Data of the experiment is shown in table 1, while making comparison of all arithmetic takes the method based on citation as benchmark. G, In order to apply it to the adjustment of the threshold the value of RLCR is needed to be mapped in (0,1) with many mapping means available. For example, the ratio of RLCR value to the maximal among all RLCR value can be mapped in (0,1). However, it tends to be influenced by some isolated point. When the RLCR value of some page is very big, it will influence other points. The following mapping method is applied in this paper, with the influence factor of RLCR to Definition 2: E G being E . is the influence factor of page RLCR to access time threshold G , with its calculation formula[4] as E =1-exp(-RLCR). But this change to numbers while RLCR >20, E is always 1. In other words, all numbers over 20 will be 1 while they are mapped in (0,1), which means it is impossible to distinguish them in the section of (0,1). Considering it, the formula is updated to the formula (3) so that the accuracy of mapping will be improved much as the upper limit to distinguish RLCR can reach nearly 500. E =1-exp(-sqrt(sqrt(R LCR))) TABLE 1- OUTCOME OF COMPARISON ON ALL METHODS OF SESSION IDENTIFICATION Methods of session construction based on citation(R) based on fixed time threshold(T) based on page access time threshold(A) based on page access time threshold and session reconstruction (3) Taking the above adjustment into account, the calculation formula is as follows: G = D t(1+ E ). EXPERIMENTAL RESULTS (4) According to the threshold of page time, structure the set of user session set To set the access time threshold of each page, the actual page statistics access time t must be got first, combining the influence factor of the page RLCR-, adjust . The access time t is recorded as St = (t1, t2, ..., tn), the influence factor is recorded as S = (1, 2, ..., n). the efficient pages 161026 |R|=27130 161026 |T|=46150 161026 93980 amount of sessions amount of session intersection |RR| =27130 |TR| =13004 precision% recall ratio% |RR|/|R| =100 |TR|/|T| =28.178 |RR|/|R| =100 |TR|/|R| =47.932 | A|=58956 |AR| =20307 |AR|/|A| =34.444 |AR|/|R| =74.851 |FA|=57705 |FAR| =21305 |FAR|/|FA| =36.921 |FAR|/|R| =78.529 C. TABLE 2 -STATISTICS ON PAGE NUMBERS BASED ON METHOD 4 the efficient pages the total page number 30136 G >10min 5965 G <5min 21107 G <3min 12083 G <1min 8075 From Table 1 and Table 2, the following conclusions can 646 be drawn From Table 1, in terms of accuracy or degree from the recall, the method 3 and method 4 are similar, but they are higher than the second method. Therefore, the function of the latter two methods is better than methods 2. However, according to the analysis of algorithm efficiency, as method 4 reduce the efficient pages by filtering the pages, thus improves the efficiency of the session identification, the efficiency of identification through method 4 is higher than any of the others. From Table 2, although the page threshold is adjusted, it may still be too big or too small, so when >10min,it is designed as 10min, when <60S, it is designed as 60S. Contrast to the traditional threshold 10min, the improved values significantly reduced (for example: when <5min, the corresponding page number makes up 21107/30136 = 70.039%) of the total number, it reflects the user's browsing habits better, so it makes more accurate classification of the session. Contrast to the fixed session length recognition, the improved method can identify long sessions. Because the maximum length of the fixed identification session is up to , usually it is designed as 25.5 min, and practically, there will be more long sessions, and the improved method can identify these long sessions. existing data pre-processing technology, the paper argues an improved method that is to eliminate the impact of mining results, then use the page access time threshold to identify the session. Testing by the experimental data of the test, and comparing with the traditional session identification method, the improved session identification not only improves the authenticity but also the efficiency of the session identification, ACKNOWLEDGMENT The authors would like to thank the anonymous referees for the useful suggestions for improving this paper. This project is supported by the National High-Tech Research and Development Plan of China under Grant No. 2009AA010307 REFERENCE [1] [2] [3] CONCLUSION IV. According to the introduction of the session identification, it comes to the conclusion that exist of the Frame page influence the authenticity and efficiency of the session identification. The traditional session identification method adapts the fixed time threshold, which may make the session aggregation deviate the real session. Based on the [4] [5] 647 Han Jia-Wei ,Meng Xiao-Feng, Ang Jing. Research on Web Mining[J]. Journal of Computer Research & Development, 2001,38 (4):405 - 414. ZHANG H. Y. LIANG W. A. An intelligent algorithm of data pre-processing in web usage mining[Z]. Intelligent Control and Automation, WCICA 2004, Fifth World Congress, 4:3119-3123. FANG Y, WANG L. J., GE Y. Study on data preprocessing algorithm in web log mining. Machine Learning and Cybernetics[C]. 2003 International Conference, 2003,1:28-32. Yin Xianliang, Zhang Wei. An Improved Session Identification Method in Web Mining [J].Huazhong University of Science and Technology Journal(natural science edition),2006,7:33-35 Fu Y,Sandhu K,Shih M. A generalization-Based Approach to Clustering of Web Usage Session C ĠProc 1999 KDD Workshop Web Mining, LNCS 1863. s.l. :S pringer-Verlag,2000:21 - 28.