Download Research on Application of Web Usage Mining in E-Government

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Research on Application of Web Usage Mining in E-Government
Personalized Information System
Liu Honglu Tian Zhihong
Beijing Jiaotong University ,Transport College,Beijing, P.R.China 100044
Abstract: After analyzing the users’ personalized demands for e-government system, this dissertation
put forward a personalized information service framework based on web usage mining ,include data
preparation model,data mining model and real time recommendation module. Then,researching on two
key algorithms:frequent access paths mining BIRCH algorithm.In the system ,we provide different
service for different users , the quality of service has be promoted. Everyone could visit the site in his
self way.
Finally ,the design and development of Web Log Mining Experimental System.This system
accomplished fundamental procedures and algorithms.The result of the test indicates: the system design
this dissertation put forward is workable.
Keywords: E -Government Personalized information service Web mining Web usage mining
,
·Introduction
,
,
,
:
From “China Informatization Development Report 2005” we can see There are more than 10000
portals of national governments. About 93% of ministries have themselves websites 73% of local
governments(province, terra, county)own websites. From the perspective of structure ,these websites
were separated into many fractious according to government agencies information resource was
distributed into different staff functions forming so-called "Information Detached Islands" .Both the
quantity and structure of the government affair information resource have made users difficult to control.
Huge quantities of irrelevant information or complicated web site structure will get users into
confusion,which is so-called ”Lost In Information”.
Therefore, it is urgent need to introduce personalized information service based on the user's interest
into electronic government system. Order to solve that problem, in this paper, we propose a framework
of personalized information service based on web log mining,and explaine the implementation of key
technology.
,
,
1
.
,
Web Usage Mining Technology
1 1 Concept of Web Usage Mining
Web usage mining is the process which extracts "interested" patterns from the web data. The web data
includes web server access log, proxy server log, browser log, user registration information, and users
session (or transaction data).In this paper we mainly use web log as data source, in this article, so,we use
the concept of web log mining instead of web usage mining in this article.
1 2 The Process of Web Log Mining
The process of web log mining is as follow:
1 Data preprocessing
Data preprocessing is the first stage of web log mining, which converts the raw data into the data
abstractions necessary for pattern discovery. Including data cleaning, user recognition,session
recognition , path supplement, transaction recognition and so on. Web log data preprocessing has a
direct impact on the correctness of models and pattern rules which are discoveried in the stage.
2 Pattern Discovery
In this stage , using various methods, we attempt to find models and pattern rules of users’ access
behavior. Common technology is as follow: clustering, classification, association rule, sequential
patterns and so on.
3 Pattern Analysis
.
)
)
)
1297
In most cases , web usage mining can find all the modals and rules. Pattern analysis is used to extracted
really interesting patterns from all these models and rules. Common analysis methods are visualization
technology, database inquiries and so on.
1 3 Related Technology
1 Sequential Patterns
The technique of sequential pattern discovery attempts to find inter-session patterns such that the
presence of a set of items is followed by another item in a time-ordered set of sessions or episodes.
Frequent access paths mining of one user is one example of sequential patterns. It discoveries frequent
access paths from a time-ordered transaction set.
2 Clustering
Clustering is a technique to group together a set of items having no marks but having similar
characteristics. In the web usage domain, there are two kinds of interesting clusters to be discovered :
usage clusters and page clusters.
.
)
)
2
.
E-government Personalized Information Service System Framework
2 1 Personalized Information Service
Personalized information service is a personalized service which is also an information services. It
provides users with the information recommendation only meets his personal characteristics, based on
being familiar with the interests and web behavior of users, and should be able to structure according to
the user's knowledge, psychological orientation, information needs and mode of behavior, such as user
needs to provide sufficient incentive to promote effective retrieval and user access to information. to
promote the effective use of information to users based on knowledge and innovation.
2 2 Personalized Information Service Model
We put forward such a service model : using mining technology, according to analysis of web users’ log
or other records, the system find the users’ visit habits and mode of interest, then matches with
information in website, finally, recommend to customers information that may interest them.
.
Web log data
Extract "interested" patterns
Real-time information recommendation
users
Figure 1: Personalized Information Service Model
.
2 3 System Requirements Analysis
As a typical information service system which main function is providing information, the requirements
of e-government personalized information service system are as follow:
1 Users’ frequent access paths recommendation
Patterns mining engine that system provides, finds users’ frequent access paths, and links to hyperlinks
in pages showing to the users. In other words, system can automatically identify each user’s frequently
visited pages and stores them , when the user visit the site next , hyperlinks of those pages will be on the
home page, user can directly link to the pages.
)
1298
)
2 Usage clusters recommendation
In addition to users’ frequent access paths recommendation, the system also recommend information
based on usage clusters. That is to say, recommend the user information which is visited by other
members of his cluster group . As the user and other users of the cluster group have similar interests,
information other users visited also might interest this user.
2 4 System Framework Design
To achieve those individual demands above, web log mining and personalized information service
model are integrated , personalized information service system framework is designed, as follows:
.
Site
Structure
Web pages
Web logs
Data preprocessing
Data preprocessing module
Session(transaction)files
Mining module
web log mining algorithms
usage clusters
page clusters
Frequent access
paths set
Real-time intelligent
recommendation module
Other
Real-time intelligent recommendation module
Recommended pages
Current session
Web server
user
Figure 2: E –government Personalized Information Service System Architecture
1) Data preprocessing module: corresponding to the data preprocessing process of log data.
1299
2) Data Mining module : in the module ,difficult question is how to deal with different issues using
different algorithm. Frequent access paths mining algorithm and usage clustering algorithm BIRCH will
be applicated in this paper ,to achieve requirements of personalized service.
3) Real-time intelligent recommendation module: it is the only online processing module, and it is
adapter between users and system.
3
.
Web Log Mining Algorithms
3 1 Frequent Access Path Mining Algorithm
Users’ frequent access path is a set of pages sequence browsed by user for a certain period of time , it
can most reflect the user's interests in current period of tme. Therefore, to find the current interest, to
provide users with personalized service, users frequently access path mining, obviously, has a very
important significance. In the frequent access path mining algorithm, input data is the result set of
transaction recognition : MFP set. Output is set of user’s frequent access paths and the corresponding
support. Based on these conclusions the system can find user interest models. Relevant definitions and
concepts are as Follows.
, , ,χ
n }be a page sequence and P is called a frequent page sequence
Definition 1 Let P={ χ 1 χ 2 …
or frequent access pattern if P meet the condition:
%
{T | P ⊂ T }
× 100 ≥ S min
| WTS |
where T is a web transaction and S min (0< S min <1)is a minimal support threshold specified by
user.WTS is a web transaction set.
Definition 2 A page sequence of length n is also called an n-sequence.
Candidate path: Two time-ordered subsequence{
Definition 3
χ j+ k -1
χj
,… , χ
j+ k - 2
}and{
χ j+1
,…,
}are both elements of FPk -1 ,in other words,their supports aren’t smaller than support of
Pk-1.m ,then call{
χj
, …, χ
j + k -1
} candidate path of FPk .
In order to mine users’ frequent access path support of that is k, construct FPk . The main idea of this
algorithm is based on the concept of candidate paths set. From MFP set to find a candidate path which
length is k, then calculation its support in all users’ session. The M largest support of paths is set
FPk..m .
Construct FPk Algorithm:
Input: Set of MFP: Fi
Output: Set of frequent access path: FPk (k> l)
For every Fi {
χ
,χ , ,χ
2
…
For each{ 1
if (k≤m){
For (j =1;j< m -k+l;j + +){
m }in
,…, χ }has in FP
χ
χ
support of{ ,…,
} +1
χ
χ
else if support of{ ,…,
if{
χj
Fi {
j + k -1
k
j
j + k -1
j
j+ k - 2
}≥
s k −1
1300
And support of {
χ j+1
, ,χ
χ
, …, χ
j+ k -1
}≥
s k −1
j+ k -1
Insert { j …
}in to FPk ;
}}}}
Before call above algorithm, calculate support of every page in the session, which is the length of
1.Then from 2 until k,cycle call this algorithm, each cycle can use the results of the last cycle supports.
3 2 BIRCH Algorithms
BIRCH ( Blanced Interative Reducing and Clustering Using Hierarchy)integrated a variety of clustering
technology. Single-pass scanning of the data object produces basic clustering. Through many times of
scanning significantly improves the quality of clustering.
Before applicate BIRCH algorithm, sessions need numerical processing. | X | said page volume of
website, each session can be expressed as a | x | - dimensional vector, per dimension value can be
1or0.1said the page had been visited in this session, 0 said the page was not visited.
.
Table 1: Visit Matrix
session
X1
X2
X3
X4
1
1
1
0
1
2
0
0
1
1
3
1
0
1
0
BIRCH algorithm can be applied to dynamic clustering, and it needs two parameters : the largest
number K can be, the largest radius of r. Each cluster is expressed by a dual Group ( Ci , Ri ),
respectively, said the radius and center of the cluster . BIRCH algorithm ensures each cluster is tight
when the number of cluster is not more than K.when the number of cluster is more than K, it is needed
toamplificate the threshold r.
Algorithm steps are as follows:
1)Temporarily determine the cluster center: random select several session vectors, make it meet the
condition: the distance between each other is more than 2r ,j= 1;
2) Read into the j-th session record , calculate its’ distance “d” to each cluster center. Assuming the
distance to Ci is the shortest;
'
3)Assuming let J-th session attribute to the i-th cluster, calculate the radius of the new i-th cluster, if Ri
≤ r, it shows that i-th cluster also remains tight, so j attribute the i-th cluster, which is updated to the
'
radius Ri ;
'
If Ri > r,,it shows that the i-th cluster doesn’t remain tight; If now the number of clusters is smaller than
, R = 0 ;else amplificate the
K,let session j be a new cluster;the center and radius of that are Ci =j
threshold r,go to step1) calculation one more time.
4) j= j+ 1;if j≤the total number of session,go to step2), else the end.
4
.
i
Experimental System-- WLMES
4 1 System Design
The development of the system based on windows 2003 server operating system platform and tomcat
server platform ,using java programming language. System interface is as follow :
1301
Figure 3:
System Interface
.
4 2 Experimental Analysis
Select a web server log files as the test data, including more than 80 records of May 5, 2006, the
following table shows some of the original web log.
Figure 4: Original Web Log
The web log data is based on a simple web sites, topology of the site is as follow : (Every alphabetical
name represents a page)
1302
Figure 5: Topology of the site
In order to improve the accuracy and reduce the amount of excavation,use the first session of the first
user as a source of data mining, using the path length of three mining frequent access path algorithm,
analysis and the results are as follows :
When k = 3 ,set of frequent access path and the corresponding page MFP support :
{bej=2, abe=4}
The concludes :
In the first session of the first user , considering path length is three, the numbers of visits are {bej=2,
abe=4}.In the other way , from a to b then e ,the path is visited 4 times, and from b to e then j,the path is
visited 2 times.
Based on these results, the user is more interested in the content of a->b->e,and it can be recommend to
the user when the user has just entered.
·Conclusion
Web mining is a good choice in e-government personalized service system, its future development has a
very broad prospect. However, both in theory and in practice, there are still some problems in web usage
mining. In the future we will focus on such research directions : to accurately identify the users in a
agent environment and the judgment of session boundaries.
References
[1] Sun Huamei. Research on Web Usage Mining Method And the Theory . Doctoral Dissertation of
Harbin Industry University. March 2005
[2]” China Informatization Development Report 2005”.State Council Informatization Office, July 26,
2005
[3] Li Jianxiang. Applications of BIRCH Algorithm in Design of Adaptive Web. Beijing Business
University (Natural Science). June 2003, Volume 21, No. 2
[4]Liu Hong-lu etc.. "Introduction to E-government". Posts And Telecom Press 2005
[5]”2004 China Internet Information Resources Report. " State Council Information Office. April 14,
2005
[6] Tao Huanhua ,Jiang Lingyan. Web-based Data Mining Behavior Analysis and Research. Fujian
Computer. 2004 No. 3
[7] Jin Fengrong. Study of Web Usage Mining and Discovery of Browse Interest. Master's Degree
Thesis of Beijing Science and Technology University. February 2004
[8] Zhang Sulan. Research and Implementation of Web Users Visited Mining Related Technology.
1303
Master's Degree Thesis of Beijing Science and Technology University. May 2004
[9] Xie Chunli. Analysis and Research of Web-based Data Mining Behavior. Master's Degree Thesis of
Suzhou University. April 2003
[10] Han Jiawei. Research on Web Mining. Computer Research and Development. 2001
[11]Yang Yiling, etc.. A simple Web Log Mining System. Shanghai Jiaotong University Journal. July
2000, Vol 34 No. 7
1304