Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Ernestina Menasalvas [email protected] Facultad de Informatica Univesidad Politecnica de Madrid May 2004 Introduction and motivation • • • • Internet as a communication channel. Technology needed to develop new services, security, infraestructure, analysis Web Mining to analyze the patterns so the services reply to user needs Most of the webmining projects that have been developed, have note taken into account the context in which they have been developed: – Competitive society – Success criteria dependes both: • User satisfaction • Sponsors benefit increase • The gap between tecnology depelopment in the web and the business factors is increasing and genetares as a side effect a separation on what tecnologist develop and what the companies need. • • Knowing that the problem exists is just the begining… Technological projects have to be integrated in the global strategy of the company The problem • Innovative ideas in e-commerce are vaguely defined so they loose focus and precision • New technologies are being applied consuming resources but without appropriate finantial or economic benefits • Growth of the web activity, participation in every daily activity (commercial, educational news, ..) is not being replied by an accordindly number of servicies • Services are being considered insuficient. • Thus, site sponsors have to improve offered services to satisfy the increasing growth in demand. • On the other hand, the growth in offers will bring a growth in demand what will make that the consumer will ask for a better service offer. • Web Mining projects have to be planned as one more project in the global strategy of the company Web Site personalization Optimization and personalization of user web experience is crucial for attracting and retaining electronic, web-based commerce customers. Try to maintain the one-to-one relationship Identifying future behaviour is crucial for the site to act proactively. Information about user experience is captured in clickstream logs: pages viewed, timing, and sequence. Solutions given: – – – – – • Clustering of users Cluster of pages Most visited path Recommender systems … The question: – – – – How to deploy? How has the method been evaluated? How does it helps to the company How does it evolves in time? Web Mining project evaluation • • • Criteria being used to evaluate the success of a site takes not external (commercial) aspects into account. Site aspects such as: increasing volume of selling, fraud decrease, customer retention, competitivie prizes are not explicitiy tackled Success in web sites is a measure related to eficiency and quality: – Efficiency: number of pages being accessed along one session, lenght of the session and actions developed – Quality: respose time of the site to the user requests, pages accesibility, visitors per page … • Company success is evaluated in terms of: – Incomes, Outcomes, Expenses – ROI, Market presence • • • Differences between criteria used to evaluate the success of any project in the entreprise compared to those in the case of a web project are in the root of the problem of webmining not complete success Site sponsors do no evaluate commercial and finantial aspects and are only based on vague commertial notions The success in terms of use, structure and content has to be linked to company business goals achievement Web Mining project management • An enterprise is a system design to fulfil certain goals by means of the integration of different resources. • Subsistems are at the same time interrelated and inter independent • When the company uses the Web as a channel, all the services, infraestructure, …, has to be seen as one of the subsystems. • Success of solution in the web subsystem has to be related to the behaviour of the rest of the subsistems • Web Mining projects are concerned with the Web subsystem • So web mining project is not only an IT problem • Apply a project management methodology to control the process: A project manager is needed-> different role from the data miner • Identify Data Mining problems. • For each of them apply CRISP-DM Web Mining Project management (cont) • To properly deal with a data mining project we need explicit information of the company: – – • Company environment, identify: – – • Structure of the company (departments, sections, channels, …) Goals of the company and success criteria (both at the higher level and at the department level) Resources, constraints, and any factor that can determine the goal analysis and the development of a web project Web Project goals and their relationship with the goals of the company To evaluate if the web mining project results contribute to the company goals fulfilment: – – – The web site is not usually the end but the means. It is of the channels that the company uses to achieve goals. So in order to establish a site as a sucessful site, then it is a must the activities being developed through the site to generate value for the company • Traditional approaches only analyze the site from the user perspective, but the actions of the users have to generate value for the company • It is a CRM project • Web Project plan generation CRM project – the three legs Customer Interaction ERP/ERM Supply Chain Mgmt. Analytical CRM Legacy Systems Order Manag. Order Prom. Service Automation Marketing Automation Sales Automation Mobile Sales Field Service Voice (IVR, ACD) Closed-Loop Processing (EAI Toolkits, Embedded/Mobile Agents Mobile Office Front Office Back Office Operational CRM Conferencing E-mail Web Conferencing Response Management Data Warehouse Customer Activity Customers Products Vertical Apps. Marketing Automation Category Mgmt. Campaign Mgmt. Fax Letter Collaborative CRM Direct Interaction Data Mining Increasing potential to support business decisions Making Decisions Relationship with End User Data Presentation Visualization Techniques Data Mining Information Discovery Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA Fact Gap “Fact Gap”: difference between the available information and the ability to take decisions based on these information. (Gartner Group) Data Mining gives the intelligence • Data bases gives the data. • But intelligence is needed to explore the data to find the patterns, rules and ideas to explain what is going on and to predict what will go on • Techniques and tools are needed to add this intelligence to data in order to extract the maximum benefit from data. • But tools alone (nowadays) do not put the intelligence, this has to be provided by EXPERTS and translated into the data for better understanding Data warehouse and data bases are the support Data Mining Standard process model : Crisp-DM Problem Understanding Data Understanding Data Preparation Deployment Modeling Evaluation Building the bridge • In order to provide users with the most appropriate solution, data to be analyzed have to be enriched with business information • Business problems have to be translated to data mining problems • Results have to be understable not only by data mining experts but also by end users • Underlying the data mining solution semantics has to be settled Deeper analisis of Personalization • What is personalization? • Observe user-web page interactions to identify patterns that: indicate high-level user activity, anticipate future use activity, Make it possible to proactively act • What is going to be personalized? – The site: this means pages according to the users behaviour or pattern • Why the personalization is needed? – To improve the site performance – The web is just another channel – Site performance has to do with improving the goals of the company • Who is the user? – Navigator – Customer Web Data to be analyzed • In any web mining problem we have data related to: – Pages – Navigators and navigation – Customers and their transactions • Web Logs is just the begining • Not only the data has to be taken into account but all the circumstances under which the data were collected: • Environment – General – Organization-related – Customer-related Enviroment • Affects both direct and indirectly to the way activites occur. Between the factors to take into account: – – – – Legal conditions Technological conditions Demography Ecological conditions (weather, transports, communications) – Cultural and social conditions – Geographical situation • Take into account the location of the site, of the navigator, … Information to be added • Departments: – – • Products, services: – – – • – – – Static data: gender, demographic information (varies over the time but in a particular moment it is static) Roles:… Behavior with the company being analyzed: number and kind of transaction he/she performs Behavioural data related to the environment (economy, legal constraints, climate,…) Navigators: – – • Data per se of the object: size, color, … Data relevant for the company: margin of benefits, top ten, … How it is presented in the web People consumers in general: – • The same concept can have different meaning depending on the department Product for marketing is not the same than for production Web Log: Location (IP), time, browser,… Behaviour : comparative with the “normal” if any to discover : mood, different location, … Dates – – – Itself has no meaning Legal and fiscal periods, holidays, weekend, Opening, closure, …. Data enrichment • • • There is no method, no model to follow. It is more an art Only with experience Projects for the same domain share the enrichment: – – – – • A model could be established Evaluate if data are appropriate to mine Evaluate kind of patterns that can be obtained Evaluate if a certain pattern cannot be obtained Metadata is needed about the data – Meaning for the business of each value, attribute, page, action, … • Metadata for each attribute, has to include semantics: – Meaning: group according to it: demographical, behavioural, enviromental, social, cultural – Business value – Cirmcunstances – Constraints – Relationship with other concepts • • Ontology of concepts ??? Integrate metadata so the mining activity deals with them. Data Modelling and deployment • Once enriched data, patterns extracted can be interpreted according to: – User profiles – Session value (according to certain goals) – Period of the day • Solution has to be deployed and integrated in the site structure. • Patterns evolve in time as new data are coming • Models have to be refined • Establish the basis for the model to be refined without performance decrease Web Mining infraestructure User HTTP Client HTTP Response Interface Agent HTTP Request HTTP Response Original WEBSITE DECISION LAYER User Agent Action Plan USERS CRM SERVICES PROVIDER LAYER Agents Planning Planning Planning Agent Agent VWi Agent User Model Services Information Operational PLANS SEMANTIC LAYER Agents Models WebLogs Case-study: act according to the value of the current session Patterns to help: Predict user behavior based on current behavior, not identity. Abstract user behavior with varying degrees of granularity => subsessions. Estimate the value of the session to accordidly act Subsessions capture/approximate user state information. Key concept: frequent behavior paths. Markov model to predict next set of pages and behaviour Webhouse to store information about users Modify APACHE: pop ups and precaching Case-study 1. Find behavior rules Partial tree: Define break points as decision points in the path. Use them to create rules. Break point PIND Knowing PIND allows us to predict a set of pages to follow.... PDEP PDEP Break point Behaviour rules – Página principal, Tablón – Página principal, Tablón – Página principal, Tablón Exámenes Prácticas, Material apoyo Práctica 1 Prácticas, Material apoyo Práctica 2 Exámenes -3 ... Página principal 3 2 Material de apoyo Práctica 1 4 Tablón Página de Decisión Prácticas 5 Página Objetivo Material de apoyo Práctica 2 2. Find Subsessions Sessions may be described in terms of subsessions. E.g., browse catalog, browse shipping information, browse privacy notices, perform purchase. Subsessions may be defined in a number of PDEP ways, according to the desired semantics. E.g., use breakpoints. PIND PDEP Click-path Subsession Figure Real-time user web page access path, with identified frequent pa Web page access path expressed as a sequence of subsession 3. Markov models to predict behavior and paths Behavior X BK N Behavior Y BK M BK P .. . session1 session2 session3 Dep2 session4 Dep1 session5 Dep3 session6 4. Per user analysis: average time spent in page 60 50 Time (secs) 40 30 20 10 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223 URLs 5. Online Value evolution 35 30 Value 25 20 15 10 5 0 -5 1 2 3 4 5 6 sesión 1 7 8 9 10 11 12 13 14 15 sesión 2 sesión 3 Traversed number of links Benefits of the algorithm • Makes it possible to know at any point if the ongoing navigation would be beneficial for the site, so that the site can be dynamically adjusted accordingly. • Quantify the value of a user session while he or she is navigating • Makes relationship user - site closer to real life relationships • The algorithm integrates the site/department goals: – Sends pop ups to students according to the exercises they have already done – Professors can establish preferences and the rules are changed accordingly – … Conclusion • Without a proper project management: – Difficult to obtain significant patterns – Difficult interpretation of the resutls – The potential of the process is minimized • Site goals have to be integrated • Algorithms alone are of not use: The best algorithm not always means the best result • The patterns have to be deployed in a proper architecture THANKS! QUESTIONS???