Download Research on Application of Web Usage Mining in E-Government

Research on Application of Web Usage Mining in E-Government Personalized Information System Liu Honglu Tian Zhihong Beijing Jiaotong University ,Transport College,Beijing, P.R.China 100044 Abstract: After analyzing the users’ personalized demands for e-government system, this dissertation put forward a personalized information service framework based on web usage mining ,include data preparation model,data mining model and real time recommendation module. Then,researching on two key algorithms:frequent access paths mining BIRCH algorithm.In the system ,we provide different service for different users , the quality of service has be promoted. Everyone could visit the site in his self way. Finally ,the design and development of Web Log Mining Experimental System.This system accomplished fundamental procedures and algorithms.The result of the test indicates: the system design this dissertation put forward is workable. Keywords: E -Government Personalized information service Web mining Web usage mining ， ·Introduction ，，，： From “China Informatization Development Report 2005” we can see There are more than 10000 portals of national governments. About 93% of ministries have themselves websites 73% of local governments(province, terra, county)own websites. From the perspective of structure ,these websites were separated into many fractious according to government agencies information resource was distributed into different staff functions forming so-called "Information Detached Islands" .Both the quantity and structure of the government affair information resource have made users difficult to control. Huge quantities of irrelevant information or complicated web site structure will get users into confusion,which is so-called ”Lost In Information”. Therefore, it is urgent need to introduce personalized information service based on the user's interest into electronic government system. Order to solve that problem, in this paper, we propose a framework of personalized information service based on web log mining,and explaine the implementation of key technology. ，， 1 ．， Web Usage Mining Technology 1 1 Concept of Web Usage Mining Web usage mining is the process which extracts "interested" patterns from the web data. The web data includes web server access log, proxy server log, browser log, user registration information, and users session (or transaction data).In this paper we mainly use web log as data source, in this article, so,we use the concept of web log mining instead of web usage mining in this article. 1 2 The Process of Web Log Mining The process of web log mining is as follow: 1 Data preprocessing Data preprocessing is the first stage of web log mining, which converts the raw data into the data abstractions necessary for pattern discovery. Including data cleaning, user recognition,session recognition , path supplement, transaction recognition and so on. Web log data preprocessing has a direct impact on the correctness of models and pattern rules which are discoveried in the stage. 2 Pattern Discovery In this stage , using various methods, we attempt to find models and pattern rules of users’ access behavior. Common technology is as follow: clustering, classification, association rule, sequential patterns and so on. 3 Pattern Analysis ．））） 1297 In most cases , web usage mining can find all the modals and rules. Pattern analysis is used to extracted really interesting patterns from all these models and rules. Common analysis methods are visualization technology, database inquiries and so on. 1 3 Related Technology 1 Sequential Patterns The technique of sequential pattern discovery attempts to find inter-session patterns such that the presence of a set of items is followed by another item in a time-ordered set of sessions or episodes. Frequent access paths mining of one user is one example of sequential patterns. It discoveries frequent access paths from a time-ordered transaction set. 2 Clustering Clustering is a technique to group together a set of items having no marks but having similar characteristics. In the web usage domain, there are two kinds of interesting clusters to be discovered : usage clusters and page clusters. ．）） 2 ． E-government Personalized Information Service System Framework 2 1 Personalized Information Service Personalized information service is a personalized service which is also an information services. It provides users with the information recommendation only meets his personal characteristics, based on being familiar with the interests and web behavior of users, and should be able to structure according to the user's knowledge, psychological orientation, information needs and mode of behavior, such as user needs to provide sufficient incentive to promote effective retrieval and user access to information. to promote the effective use of information to users based on knowledge and innovation. 2 2 Personalized Information Service Model We put forward such a service model : using mining technology, according to analysis of web users’ log or other records, the system find the users’ visit habits and mode of interest, then matches with information in website, finally, recommend to customers information that may interest them. ． Web log data Extract "interested" patterns Real-time information recommendation users Figure 1: Personalized Information Service Model ． 2 3 System Requirements Analysis As a typical information service system which main function is providing information, the requirements of e-government personalized information service system are as follow: 1 Users’ frequent access paths recommendation Patterns mining engine that system provides, finds users’ frequent access paths, and links to hyperlinks in pages showing to the users. In other words, system can automatically identify each user’s frequently visited pages and stores them , when the user visit the site next , hyperlinks of those pages will be on the home page, user can directly link to the pages. ） 1298 ） 2 Usage clusters recommendation In addition to users’ frequent access paths recommendation, the system also recommend information based on usage clusters. That is to say, recommend the user information which is visited by other members of his cluster group . As the user and other users of the cluster group have similar interests, information other users visited also might interest this user. 2 4 System Framework Design To achieve those individual demands above, web log mining and personalized information service model are integrated , personalized information service system framework is designed, as follows: ． Site Structure Web pages Web logs Data preprocessing Data preprocessing module Session(transaction)files Mining module web log mining algorithms usage clusters page clusters Frequent access paths set Real-time intelligent recommendation module Other Real-time intelligent recommendation module Recommended pages Current session Web server user Figure 2: E –government Personalized Information Service System Architecture 1) Data preprocessing module: corresponding to the data preprocessing process of log data. 1299 2) Data Mining module : in the module ,difficult question is how to deal with different issues using different algorithm. Frequent access paths mining algorithm and usage clustering algorithm BIRCH will be applicated in this paper ,to achieve requirements of personalized service. 3) Real-time intelligent recommendation module: it is the only online processing module, and it is adapter between users and system. 3 ． Web Log Mining Algorithms 3 1 Frequent Access Path Mining Algorithm Users’ frequent access path is a set of pages sequence browsed by user for a certain period of time , it can most reflect the user's interests in current period of tme. Therefore, to find the current interest, to provide users with personalized service, users frequently access path mining, obviously, has a very important significance. In the frequent access path mining algorithm, input data is the result set of transaction recognition : MFP set. Output is set of user’s frequent access paths and the corresponding support. Based on these conclusions the system can find user interest models. Relevant definitions and concepts are as Follows. ，，，χ n }be a page sequence and P is called a frequent page sequence Definition 1 Let P={ χ 1 χ 2 … or frequent access pattern if P meet the condition: ％ {T | P ⊂ T } × 100 ≥ S min | WTS | where T is a web transaction and S min (0< S min <1)is a minimal support threshold specified by user.WTS is a web transaction set. Definition 2 A page sequence of length n is also called an n-sequence. Candidate path: Two time-ordered subsequence{ Definition 3 χ j+ k -1 χj ，… ， χ j+ k - 2 }and{ χ j+1 ，…， }are both elements of FPk -1 ,in other words,their supports aren’t smaller than support of Pk-1.m ,then call{ χj ， …， χ j + k -1 } candidate path of FPk . In order to mine users’ frequent access path support of that is k, construct FPk . The main idea of this algorithm is based on the concept of candidate paths set. From MFP set to find a candidate path which length is k, then calculation its support in all users’ session. The M largest support of paths is set FPk..m . Construct FPk Algorithm: Input: Set of MFP: Fi Output: Set of frequent access path: FPk (k> l) For every Fi { χ ，χ ，，χ 2 … For each{ 1 if (k≤m){ For (j =1;j< m -k+l;j + +){ m }in ，…， χ }has in FP χ χ support of{ ，…， } +1 χ χ else if support of{ ，…， if{ χj Fi { j + k -1 k j j + k -1 j j+ k - 2 }≥ s k −1 1300 And support of { χ j+1 ，，χ χ ， …， χ j+ k -1 }≥ s k −1 j+ k -1 Insert { j … }in to FPk ; }}}} Before call above algorithm, calculate support of every page in the session, which is the length of 1.Then from 2 until k,cycle call this algorithm, each cycle can use the results of the last cycle supports. 3 2 BIRCH Algorithms BIRCH ( Blanced Interative Reducing and Clustering Using Hierarchy)integrated a variety of clustering technology. Single-pass scanning of the data object produces basic clustering. Through many times of scanning significantly improves the quality of clustering. Before applicate BIRCH algorithm, sessions need numerical processing. | X | said page volume of website, each session can be expressed as a | x | - dimensional vector, per dimension value can be 1or0.1said the page had been visited in this session, 0 said the page was not visited. ． Table 1: Visit Matrix session X1 X2 X3 X4 1 1 1 0 1 2 0 0 1 1 3 1 0 1 0 BIRCH algorithm can be applied to dynamic clustering, and it needs two parameters : the largest number K can be, the largest radius of r. Each cluster is expressed by a dual Group ( Ci , Ri ), respectively, said the radius and center of the cluster . BIRCH algorithm ensures each cluster is tight when the number of cluster is not more than K.when the number of cluster is more than K, it is needed toamplificate the threshold r. Algorithm steps are as follows: 1)Temporarily determine the cluster center: random select several session vectors, make it meet the condition: the distance between each other is more than 2r ,j= 1; 2) Read into the j-th session record , calculate its’ distance “d” to each cluster center. Assuming the distance to Ci is the shortest; ' 3)Assuming let J-th session attribute to the i-th cluster, calculate the radius of the new i-th cluster, if Ri ≤ r, it shows that i-th cluster also remains tight, so j attribute the i-th cluster, which is updated to the ' radius Ri ; ' If Ri > r,,it shows that the i-th cluster doesn’t remain tight; If now the number of clusters is smaller than ， R = 0 ;else amplificate the K,let session j be a new cluster;the center and radius of that are Ci =j threshold r,go to step1) calculation one more time. 4) j= j+ 1;if j≤the total number of session,go to step2), else the end. 4 ． i Experimental System-- WLMES 4 1 System Design The development of the system based on windows 2003 server operating system platform and tomcat server platform ,using java programming language. System interface is as follow : 1301 Figure 3: System Interface ． 4 2 Experimental Analysis Select a web server log files as the test data, including more than 80 records of May 5, 2006, the following table shows some of the original web log. Figure 4: Original Web Log The web log data is based on a simple web sites, topology of the site is as follow : (Every alphabetical name represents a page) 1302 Figure 5: Topology of the site In order to improve the accuracy and reduce the amount of excavation,use the first session of the first user as a source of data mining, using the path length of three mining frequent access path algorithm, analysis and the results are as follows : When k = 3 ,set of frequent access path and the corresponding page MFP support : {bej=2, abe=4} The concludes : In the first session of the first user , considering path length is three, the numbers of visits are {bej=2, abe=4}.In the other way , from a to b then e ,the path is visited 4 times, and from b to e then j,the path is visited 2 times. Based on these results, the user is more interested in the content of a->b->e,and it can be recommend to the user when the user has just entered. ·Conclusion Web mining is a good choice in e-government personalized service system, its future development has a very broad prospect. However, both in theory and in practice, there are still some problems in web usage mining. In the future we will focus on such research directions : to accurately identify the users in a agent environment and the judgment of session boundaries. References [1] Sun Huamei. Research on Web Usage Mining Method And the Theory . Doctoral Dissertation of Harbin Industry University. March 2005 [2]” China Informatization Development Report 2005”.State Council Informatization Office, July 26, 2005 [3] Li Jianxiang. Applications of BIRCH Algorithm in Design of Adaptive Web. Beijing Business University (Natural Science). June 2003, Volume 21, No. 2 [4]Liu Hong-lu etc.. "Introduction to E-government". Posts And Telecom Press 2005 [5]”2004 China Internet Information Resources Report. " State Council Information Office. April 14, 2005 [6] Tao Huanhua ,Jiang Lingyan. Web-based Data Mining Behavior Analysis and Research. Fujian Computer. 2004 No. 3 [7] Jin Fengrong. Study of Web Usage Mining and Discovery of Browse Interest. Master's Degree Thesis of Beijing Science and Technology University. February 2004 [8] Zhang Sulan. Research and Implementation of Web Users Visited Mining Related Technology. 1303 Master's Degree Thesis of Beijing Science and Technology University. May 2004 [9] Xie Chunli. Analysis and Research of Web-based Data Mining Behavior. Master's Degree Thesis of Suzhou University. April 2003 [10] Han Jiawei. Research on Web Mining. Computer Research and Development. 2001 [11]Yang Yiling, etc.. A simple Web Log Mining System. Shanghai Jiaotong University Journal. July 2000, Vol 34 No. 7 1304

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Research on Application of Web Usage Mining in E-Government