Download ppt

September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia Research carried out within the laboratory: Statistical models for data mining (SMDM) [email protected] A small sample of web clickstream data (from a logfile) C, “10908”, 10908 V, 1108 V, 1017 C, “10909”, 10909 V, 1113 V, 1009 V, 1034 C, “10910”, 10910 V, 1026 V, 1017 [email protected] Analysis of web clickstream data 1. In data matrix form (Giudici and Castelo, 2001; Blanc and Giudici, 2001): - Association measures - Association models (graphical association models) 2. In transactional data form (in this talk) - Association and sequence rules - Statistical models for sequences [email protected] Association measures and models Based on data arranged in contingency table form FOR INSTANCE: Odds ratios Graphical loglinear models Recursive logistic regression models For a review, see Giudici, Applied data mining, Wiley, 2003 [email protected] Association and sequence rules Implemented in main Data Mining softwares Based on transactional databases Such databases arise for instance in - Market basket analysis (order does not matter) - Web clickstream analysis (order matters) Aim: search for itemsets (groups of events) that occurr simultaneously with a high frequency [email protected] Formally: • A1, .., Ap: p binary random variables. Itemset: logical expression such as A = (Aj1 = 1 ,...,.Ajk =1), k< p. Association rule: logical relationship between two itemsets: e.g. if A, then B  A  B Example:A= (Milk, Coffee) B=(Bread, Biscuits)  Sequence rule: the relationship is determined by a temporal order. Example: A= (Home, Register) B=(P_info) [email protected] Interestingness of a rule • Support  A  B = • Confidence  A  B = N A B N N A B NA = support support A  B  A • Lift  A  B=Confidence  A  B / Support (B) A priori search algorithm (Agrawal et al., 1995): based on the support. [email protected] Application to real data Data set from a logfile of an e-commerce site, kindly supplied by SAS. Contains the userid (C_VALUE), the time of connection (C_TIME) and the page visualised (C_CALLER). Number of clicks: 21889; Number of visitors (sessions): 1240. [email protected] Exploratory step (data selected from a cluster of visitors, N. 3) Cluster mean Overall mean [email protected] Cluster N.obs Variables 1 8802 CLICKS LENGTH start %PURCH 8 6 min h. 18 0.034 2 2859 CLICKS LENGTH start %PURCH 22 17 min h. 15 0.241 3 1240 CLICKS LENGTH start %PURCH 18 59min h. 13 0.194 4 9251 CLICKS LENGTH start %PURCH 8 6 min h. 10 0.039 10 10 min 14 h 0.072 Remark Data could have been transformed from transactional to data matrix format. Doing so information on the order of the visited pages would have been lost Data matrix format for the considered data: [email protected] Application of the apriori algorithm Most frequent indirect sequences of order 2 [email protected] Most frequent indirect sequences of any order [email protected] Proposal: direct sequences • Only “subsequent” visits are being considered • We have inserted two fictitious (deterministic) pages: (start_session; end_session) [email protected] Most frequent direct sequences of order 2 [email protected] Towards a global model: graphical representation of direct association rules [email protected] Link analysis representation [email protected] Global models for web mining Sequence rules are an instance of a local model (or pattern, see Hand et al, 2001) of data mining. A local model draws statistical conclusions on parts of the dataset, rather than on the whole. Link analysis is an example of a global descriptive model. We have considered two global inferential models: - probabilistic expert systems - Markov chains [email protected] Probabilistic expert systems Graphical models that allow to describe (recursive) dependencies between (binary) random variables Can be described by a directed conditional independence graph, that specifies the factorisation of the joint probability distribution. They ARE NOT directly comparable with sequence rules, that are local indexes to study dependencies between events (itemsets) They are built from contingency table data, thus DO NOT model order of visit to pages. [email protected] Probabilistic expert systems: structural learning [email protected] Probabilistic expert systems: quantitative learning [email protected] Markov Chains for web mining Ideal to model dependencies between events. Order of the chain parallels order of a sequence rule. Data have been structured in the following form: [email protected] Results from Markov chains (entrance to the site- start session) [email protected] Exit from the site (end session) [email protected] Most likely paths 17,80% 45,81% Start_session Home 70,18% Progra m Product 26,73% P_info Markov chains ARE DIRECTLY comparable with direct sequence rules. E.g. for the most likely path: from start_session, the highest confidence is with home (45,81%), then program (20.39,), product ( 78,09% ) and addcart (28,79%). There are small differences, due to the fact that apriori algorithm considers only rules with support higher than a fixed threshold (e.g. 5%). [email protected] Essential references Agrawal, R., Manilla, H., Srikant, R., Toivonen, H. and Verkamo, A.I. (1995) Fast discovery of association rules, in: Advances in knowledge discovery and data mining, AAAI/MIT Press, Cambridge. Giudici, P. (2003) Applied Data mining. Wiley, London. Giudici, P. and Castelo, R. (2001) Association models for web mining. Journal of Knowledge discovery and data mining, 5, pp. 183-196. Trevor Hastie, Robert Tibshirani and Jerome Friedman (2001).The elements of statistical learning: data mining, inference and prediction. Springer-Verlag. Hand, D.J., Mannilla, H. and Smyth, P (2001) Principles of Data Mining, MIT Press, New York. [email protected] THANKS FOR THE ATTENTION ! Comments to: [email protected] www.baystat.it/giudici/index.htm [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt