Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SeUGI 19, Florence June 1st, 2001 Web clickstream analysis to understand customer behaviour Erika Blanc and Paolo Giudici UNIVERSITY OF PAVIA SUMMARY AIM: to combine sequence rules with statistical association methods to obtain valuable info on e-consumers behaviour from logfiles data, and to show how this combined analyis can be carried out in SAS and SAS Enterprise Miner. • • • • • In the presentation we shall show: The available data (source: SAS Italy) Use of support and confidence rules Use of odds ratios and corresponding confidence intervals Symmetric models for sequences (graphical loglinear models) Asymmetric models for sequences (probabilistic expert systems) THE DATA Logfile of an e-commerce site. Below are some of the 250711 observations describing visit behaviour of 22527 visitors to the 36 pages of the site. DERIVED DATASET For each visitor, 36 binary variables describing visit/non visit to each page. This dataset, where order of visits is lost, will be used to calculate association rules and statistical measures. CLASSICAL SEQUENCE AND ASSOCIATION RULES N A⇒ B • Support ( A ⇒ B) = N • Confidence ( A ⇒ B) = N A⇒ B NA = support ( A⇒B) support ( A) The previous rules, implemented in SAS Enterprise Miner, are applicable to both associations and sequences, yet with a different meaning HIGHEST CONFIDENCE 2-SEQUENCE RULES FOUND IN THE DATA HIGHEST CONFIDENCE ASSOCIATION RULES FOUND IN THE DATA HIGHEST CONFIDENCE N-SEQUENCE RULES FOUND IN THE DATA GRAPHICAL REPRESENTATION FOR SEQUENCE RULES STATISTICAL ASSOCIATION MEASURES: ODDS RATIOS P(Y = 1 | X = 1) θ1 P(Y = 0 | X = 1) θR = = θ 0 P(Y = 1 | X = 0) P(Y = 0 | X = 0) Interpretation: q >1 POSITIVE ASSOCIATION q =1 NO ASSOCIATION q <1 NEGATIVE ASSOCIATION R R R ODDS RATIOS: AN EXAMPLE GRAPHICAL REPRESENTATION FOR ODDS RATIOS For odds ratios, confidence intervals can be built, giving a more correct evaluation of associations. If the confidence interval for an odds ratio contains the value of 1, the association is not significative. Therefore, in the graphical representation, NO link will be inserted between the two corresponding nodes. Otherwise, if the value of 1 is outside the confidence interval for the odds ratio, a link can be inserted. However, we shall insert only links describing positive associations. GRAPHICAL REPRESENTATION FOR ODDS RATIOS COMPARISON oDDS RATIOS-SEQUENCE RULES Odds ratios are symmetric (no order) and measure associations. Can be easily accompanied by inferential models (e.g. confidence intervals), giving a variability assessment. Sequence rules can be asymmetric (order taken into account) and measure dependencies. Cannot be related (yet) to inferential models. We are working on this. COMPARISON ODDS RATIOSCONFIDENCE MESURES FOR ASSOCIATIONS Page associations (A,B) Odds ratio Confidence (A-B) (%) Confidence (B-A) (%) freeze*pay_req pay_req*pay_res addcart*freeze download*shelf addcart*pay_req freeze*pay_res register*regpost addcart*pay_res p_info*product login*logpost download*pay_re addcart*product logpost*pay_req freeze*product download*logpost download*pay_re cart*pay_req logpost*pay_res freeze*logpost 2041,72 1876,40 1616,54 911,53 686,88 629,30 543,99 289,27 141,14 22,39 18,34 13,18 11,73 11,10 10,92 10,22 9,44 9,11 8,63 67,33 66,77 78,23 99,27 52,88 45,12 65,81 35,41 99,71 68,04 58,35 97,97 39,88 97,85 81,89 60,39 49,63 26,45 70,26 99,56 99,22 99,40 41,56 99,35 99,15 98,57 98,88 57,05 85,14 43,25 36,93 79,14 29,03 20,59 30,12 54,87 78,01 52,35 PROPOSAL: A GRAPHICAL REPRESENTATION BASED ON ODDS RATIOS AND CONFIDENCE RULES PROPOSAL: SYMMETRIC GRAPHICAL MODELS FOR CLICKSTREAM ANALYSIS SELECTED SYMMETRIC GRAPHICAL MODEL PROPOSAL: DIRECTED GRAPHICAL MODEL (PROBABILISTIC EXPERT SYSTEM) ACKNOWLEDGEMENTS This work has been carried out in a stage project of Erika Blanc, Master’s student at the University of Pavia, jointly supervised by Sabina Silani (SAS) and Paolo Giudici. We also thank SAS Italy for having supported us with the data as well as with the software Enterprise Miner. For more details on the presentation, please see: Applied statistical methods for data mining, lecture notes by Paolo Giudici, [email protected] The applied research activity of our group on data mining and risk management can be found at: www.baystat.it/giudici/index.htm Web clickstream analysis to understand customer behaviour Paolo Giudici, University of Pavia, [email protected] Erika Blanc, University of Pavia, [email protected] With the increased competition and decreased loyalty inherent in e-commerce, it is more imperative than ever for companies to gain, retain and grow their Web stakeholders (customers, prospects, partners, staff, etc.). While most companies are readily aware of the technical processes involved in setting up and maintaining a Web site, they may not know what their visitors love or hate about their site and can only guess how the site could be improved. With this weak state of information, it is difficult for the companies to personalize their relationship with their stakeholders. For this reason we believe companies should implement an e-intelligence process which involves Web server planning, click-stream behavior, visitors profiling and purchase predicting. In this paper we focus on a case study, developed with SAS Enterprise Miner, providing an example of clickstream analysis, that shows the advantageuos information that can be extracted from such a process. The case study gives also an opportunity to emphasize the importance of mining reliable association and sequence rules, as these can be of strong relevance to understand customer behaviour. With respect to this viewpoint we illustrate recent research work of ours concerning the comparison between classical association and sequence rules with statistical methods for associations. In particular we introduce graphical models, as developed in the artificial intelligence/probabilistic expert systems literature. We show how the latter can bring very useful information on web customer behaviour, and illustrate ways to practically implement them in SAS. We also compare them with respect to what currently implemented in Enterprise Miner for clickstream analysis. We finally remark that the case study has been developed in a Master's degree project stage of Erika Blanc at the University of Pavia, supervised jointly by Sabina Silani (SAS) and Paolo Giudici (University of Pavia). We acknowledge SAS Italy for both the data and the usage of Enterprise Miner.