Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Prasanna K. Desikan E-mail: [email protected] CSCI 8701 ID# 1916156 Date: 02/07/2002 Grouping Web Page References into transactions for Mining World Wide Web Browsing Patterns. This paper identifies a model of user behavior that separates web page references into those made for navigational purposes and those for information content purposes. It presents a general model for identifying transactions for data mining from WWW log data. The contributions of the paper include: defining a generic transaction identification module, defining a user browsing behavior model, development of specific transaction identification modules, evaluation of the different transaction identification modules. The user browsing behavior model assumes that a given user treats a page either for navigational purposes to find links, or for actual information or content purposes. The paper also assumes that the page references in a server log can be readily sorted by user identification. Each reference can then be classified as navigation or content reference. The paper first introduces a general model for transaction identification based on dividing a large transaction into multiple smaller ones or merging small transactions into fewer large ones. The discussion moves on to specific transaction identification modules. The First one, the reference length module is based on the assumption that the amount of time a user spends on a page correlates to whether a page should be classified as a navigation or content page for that user. In the Maximal Forward Reference module each transaction is defined to be the set of pages in the path from the first page in the log for a user up to the page before a backward reference is made. The time window module simply divides the log for a user up into time intervals no larger than a specified parameter. It assumes that meaningful transactions have an overall average length associated with them. Next, the WEBMINER system is explained in brief. It has two main parts. The first part includes the domain dependent process of transforming the Web data into suitable "transaction" form. And the second part includes the, largely domain independent, application of generic data mining techniques. A test server log data was created and three different types of web sites were modeled for evaluation, a sparsely connected graph, a densely connected graph, and a graph with medium amount of connectivity. The experimental evaluation was done using created data and real data. In both cases the reference length model performed better though the maximal forward model did fairly well in sparse connected graphs of created data and when the association rule algorithm was run with navigation-content transactions on real data. An important area of research is to develop methods of clustering log entries into user transactions using criteria such as time differential among entries, time spent on a page relative to the page size, and user profile information collected during registration.