Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Business Intelligence We help people see and understand their data (Tableau) Prof. Stefano Bordoni [email protected] Day/hours: Start/end date: Delivery Type: Total Hours/lessons: Thursday 8,30-10 29/09/2016 Lecture/Lab 48/24 Friday 8,30-10 23/12/2016 Course framework and learning resources (Text book, Guides, Help, Tutorial, Whitepapers, Support): Course Credits: 6/8h Module 1: Microsoft BI Stack - Excel Excel File -> help https://support.office.com/en-us/excel Module 2: Tableau http://www.tableau.com/learn Module 3: Data mining with Microsoft Excel https://technet.microsoft.com/en-us/library/bb510516.aspx Module 4 (optional): Zoho ecosystem https://www.zoho.com/ Related resources and additional documentation (slides, software, exercises, tutorial, source data) In common: Slides (BI2016.pptx), Data sources, Exercises, Exam information: Syllabus Exam.docx http://morgana.unimore.it/bordoni_stefano/BI2016/ Recommended: Self-service Business Intelligence e Data Mining con Microsoft Excel, Bordoni 2013, ed. Pitagora Tableau Your Data! Fast and Easy Visual Analysis with Tableau Software [Paperback] Dan Murray Learning assessment method : Multiple choice final test. Not mandatory midterm Exam. A self assessment copy of the exam software is available at: http://morgana.unimore.it/bordoni_stefano/BI2016/tester/testcourseBI2016.rar Free Software to download: Power Pivot: http://www.microsoft.com/it-it/download/details.aspx?id=29074 DM add-ins: http://www.microsoft.com/it-it/download/details.aspx?id=35578 SQL Server® 2012 Evaluation: http://www.microsoft.com/it-it/download/details.aspx?id=29066 Tableau Public: http://www.tableausoftware.com/public/ Desktop: http://www.tableau.com/products/desktop Course goals: Create meaningful visualizations by learning the science of data visualization and visual best practices. Provide competence to become an analyst and developer of Self-service Business Intelligence and Data Mining applications Free Microsoft Software from “Webstore dreamspark”* http://e5.onthehub.com/d.ashx?s=u1vjaekwzj *to install iso file you may use virtualclonedrive (http://www.slysoft.com/it/virtual-clonedrive.html) or Course modules - Documentation and resources related to Exam Modules Software Related resources and source files Introduction to BI and Self-service BI BI process in 4 steps: 1 - Business objective and data preparation 2 - Data Analysing 3 - Producing Results 4 - Sharing results Excel 2010 Access 2010 Google site Tableau Zoho reports MYBI.mdb - Dataset_libro.mdb - Sample - Superstore Subset (Excel).xlsx Microsoft BI stack: BI process and Dashboard in Excel Excel 2010 Access 2010 Power Pivot OneDrive 2 Tableau Tableau Desktop 9.3 Public 9.3 Online 3 Data Mining in Excel Excel 2010 DM add-in Excel (functions, connections) File -> help Tutorial: “Functions and techniques for reports and dashboard.xlsx“ Tutorial: “Dashboard in Excel with code.xlsm“ http://www.tableau.com/learn More on tableau: Video, Whitepapers, Quickstart, OnlineHelp, Knowledge base, Community, TC14 content, Reference guide Excel Data mining functions and commands “DMAddins_SampleData.xlsx” “DMAddins_SampleData modificato per libro.xlsx” https://technet.microsoft.com/enus/library/bb510516.aspx 4 Optional - recommended Zoho ecosystem (Reports, CRM) 1 Zoho https://www.zoho.com/ https://www.zoho.com/reports/ https://www.zoho.com/reports/help/ 2 Introduction to Visual Analytics, Business Intelligence and Data Mining. Highlights. Subscribe 4 account: Zoho, Tableau Public, Google Site, One Drive. Slides Show dash. Build a Tableau dashboard Self-service BI process. How to develop a dashboard in 3 different environment: Zoho, Excel and Tableau. Course structure. Documentation and text books. Data sources (Sample superstore) https://public.tableau.com/s/resources 3 Data connections: mybi.mdb, sample superstore, custbase.csv Excel 1 4 Small, Large, Rank, CountIf, SumIf, Lookup, VLookup, Match, Index Excel 2 5 SumProduct 1-2, Offset 1-2 Excel 3 6 Offset 3-4, Date, Left, Right, Mid Excel 4 7 Table, Pivot table Excel 5 8 Sparkline, other techniques, conditional formatting Excel 6 9 Speedometer, thermometer Excel 7 10 Dashboard in Excel Excel 8 11 Dashboard in Excel Excel 9 12 Google site and object embedding Excel 10 13 Getting Started Fundamentals -Highlights Tableau 1 14 Connecting to Data Tableau 2 15 Visual Analytics Tableau 3 16 Mapping Tableau 4 17 Dashboards and Stories Tableau 5 18 Calculations Tableau 6 19 How to -Special charts Tableau 7 20 Publishing: Server – Online –Public – Embedding Tableau 8 21 Link Analysis DM addin 22 Cluster Analysis DM addin 23 Predictive Analysis DM addin 24 FAQ and assessment Assessment 1 Intro 1 Intro 2 Business Intelligence and Self service Business Intelligence Definitions BI Goal How To Select A Software Stack BI vs Self service BI Which Technology Which Vendor Business Intelligence and Self service Business Intelligence Definitions BI Goal How To Select A Software Stack BI vs Self service BI Which Technology Which Vendor Performance measurement system BI - PMS: system of measures and indicators that compares goals with current results. It verifies expectations, detects potential deviations to correct and improve business decisions Analytical CRM, Data and Text Mining: support sales pipeline. It provides Customer and Marketing Intelligence analysis Data layer ETL process - Staging Area - Data Warehouse - Data Mart / (KDD) Knowledge discovery in Database Analysis layer Query traditional MDA OLAP cubes Univariate and multivariate statistical analysis Data and Text mining Query and MDA: Drill down Drill up Slice and dice Executive information system CRM Executive BSC I N T E L L I G E N Z A Static Report (operational level) Customer and marketing intelligence Dashboard (management level) Widgets (executive level) Web MKT EIS - ESS KPI and Alert Customer value Customer satisfaction DSS – Customer analytics Reporting and sharing Risk Management Fraud Detection Customer e Marketing intelligence Fuzzy Expert system Business Intelligence (BI) 1. BI is a set of processes, technologies, techniques, tools, skills, applications and practices used to provide information and support decision making 2. Help people see and understand their data (Tableau) 3. BI is getting the right information into the right people’s hands in a format that allows them to understand the data quickly 4. BI is a broad category of applications and technologies for gathering, storing, analyzing and providing data to help enterprise users make better decisions (TechTarget) 5. BI is an umbrella term that refers to a variety of software applications used to analyze an organization’s raw data. Is made up of several related activities, including data mining, online analytical processing, querying and reporting (CIO.com) 6. BI is a three steps process to collect consistent Data Sources (1) analyse data (2) and provide knowledge workers with unique, interactive, correct, navigable, real time results (3) PMS Perfomance measurement system (see also EIS, ESS): «if you don’t measure results, you can’t tell success from failure». PMS is the whole system of measures and indicators that lets the Company to support and verify the correct implementation of the strategy and assess goals meeting or need for corrective actions KPI (key performance indicators): a business metric used to evaluate factors that are crucial to the success of an organization. A KPI is a measurable value that demonstrates how effectively a company is achieving key business objectives. KPIs differ per organization, industry, department (marketing and sales have different KPIs) Report: A document providing information usually organized in a typical tabular form. It’s the first format of results, in the past often used as synonimous of print Widgets (CFO and executive): Also known as gauge chart, dial chart, thermometer, bullet chart or speedometer, it often uses needles to show key business information (sales, expired credit…) as a reading on a dial. Dashboard :(cruscotto, cockpit, tableau de bord): it’s an easy and visual information system similar to a car or plane dashboard. "An easy to read, often single page, real-time user interface, showing a graphical presentation of the current status (snapshot) and historical trends of an organization’s Key Performance Indicators (KPIs) to enable instantaneous and informed decisions to be made at a glance.“ They often support drill-down, slice & dice, pivoting and other interactive actions to navigate through different scenarios. Used by managers, dashboards may be tailored for a specific role and display metrics targeted for a single point of view or department. Balanced Scorecard: a "4 perspective" approach to identify 15-20 measures to track the implementation of strategy. The original four "perspectives" proposed are: Financial (short term), Customer (middle term), Internal business processes (middle term), Learning and growth (long term). The Balanced Scorecard "translates an organization's mission and strategy into a comprehensive set of performance measures that provide the framework for a strategic measurement and management system What happened - Reporting Why did it happen - Analysis What’s happening now - Monitoring What might happen - Predicting Who Uses What: Executive/Managers Business Analysts Middle management IT / Operational employees Data, information and knowledge Knowledge-Wisdom Information become knowledge when is used to make decisions and take actions accordingly Information It’s the outcome of a data processing and has a specific meaning in a specific domain Data Entity and transaction stored as simple facts Business Intelligence and Self service Business Intelligence Definitions BI Goal How To Select A Software Stack BI vs Self service BI Which Technology Which Vendor Which is our commitment? We have to select and gather transactional data from OLTP (DBMS – ERP): Flat files Relational databases Big data And transform data into information and knowledge to support decisions Business and knowledge needs - goals BI is a 3-4 steps process!!! DATA LAYER Check data existence, select, collect, gather, clean, preprocess, transform: In brief: Prepare data DISCOVER-ANALYZE LAYER Which information do we need? Data Analysis – Results evaluation (KPI, Report, Widget, Dashboard) DISPLAY- VISUALIZE LAYER Who and where need access to results? Data Presentation and Sharing (Local, Web Server, Cloud, Embedded) Business Intelligence and Self service Business Intelligence Definitions BI Goal How To Select A Software Stack BI vs Self service BI Which Technology Which Vendor In a typical BI architecture all data from all data sources is pulled into a centralized database through a process called Extract-TransformLoad (ETL). This database is called a data warehouse (DW). “A Data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.” A data warehouse contains a transformed copy of line-of-business (LOB) data. You load data to your DW occasionally, on a schedule - typically in an overnight job. The DW data is not online, real-time data. It’s considered the company’s ‘single source of truth’ by containing a compilation of only clean and accurate data. A data mart (DM) is a subset of the data warehouse that is oriented to a specific business group or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts usually is limited and only pertains to a single department. Business analytics software is offered in three different configurations: • Back-End Software Stacks: Provide only back-end functionality such as data storage, transformation and management (i.e., data warehouse and data mart functionality, as well as ETL capabilities). • Front-End Software Stacks: Provide only front-end (end-user facing) functionality, such as data visualization and visual analysis. • Full Software Stacks: Deliver both back-end and front-end functionality. To determine the best database technology for your business analytics project, you’ll need to consider the number of data sources, the need for ETL, the complexity of the data, and the scale and scope of the project. For projects with limited scope that utilize a single data source, a data mart solution is probably your best bet. When your requirements grow to include multiple data sources with terabytes of data, along with your data analytics needs, a data warehouse is the solution to support scale. What are the general objectives for the solution? These may include some or all of the following: data prep (integration, cleaning, normalization etc.); analysis; visualizations; reporting; and reporting automation. What are the specific objectives for the solution? For example, is this to replace an existing tool or tool set? To replace or fix an existing process or set of processes? To create a new process? If it’s to “replace,” what was working and not working with the previous process(es)? If it’s “new,” what are the specific bottlenecks and pain points illustrating the need for a new solution? What is the scale of implementation? Is it small-scale (department-level), or enterpriselevel? Who will be using the software? Who are the users, what are their roles and what are the levels of their technical proficiency? Who are the customers? Who are the internal (or external) customers, and what are the expectations for deliverables? How well is the solution supported? Is the solution provider flexible and available to work with the organization to train, implement, customize (if needed) or otherwise support the solution? When Would You Need the Full Software Stack? In most cases, implementing only a back-end stack or a front-end stack would not suffice to ensure a real, effective and scalable business analytics solution. In particular, you would need a full stack when: • Thousands of tables: Finding the appropriate tables and columns you need for a report can be painful using a front end stack • Bad naming conventions in an LOB database (es. Tables named Table1, Table2, and so on, and columns named Column1, Column2) • Data quality could be low: parts of the data could be missing; other parts could be wrong • End users want to access centralized data and maintain a single version of the truth rather than build a solution around CSV/Excel extracts. • ETL functionality is required, which often happens when multiple data sources are involved and when the data is especially dirty or especially large. • The data sources cannot be directly queried, either because they are not supported or because they are part of a critical operational system. Business Intelligence and Self service Business Intelligence Definitions BI Goal How To Select A Software Stack BI vs Self service BI Which Technology Which Vendor Business intelligence (BI) e Self-service (SSBI) Business Intelligence (BI) Self service Business Intelligence (SSBI) A traditional (corporate) BI solution is handled by the IT department and includes typically an ETL phase, a staging area and a data warehouse, a data modeling (OLAP, cube), a set of queries and a web based system of reports and dashboards to share results “Self-service business intelligence (SSBI) is an approach to data analytics that enables business users to access and work with corporate information without the IT department’s involvement (except, of course, to set up the data warehouse and data marts underpinning the business intelligence (BI) system and deploy the selfservice query and reporting tools)” Pros Reduce complexity of tables and names Build a DW as ‘single source of truth’ as a sole source for decision making A DW contains transformed, merged, cleaned or cleansed (more radical), historical data Certified data quality and integrity Manage large amount of data sources Cons Pros It’s user-friendly It’s intuitive It’s interactive Empower users by giving them tools that let them design their own reports Puts more control in the hands of users Easy to use, less IT support, results are quickly delivered It’s slow It’s rigid It is expensive, complex, difficult to use, IT or ERP Cons specific solution Self-service BI tools are NOT a replacement for a data DW preparation and Creating reports is very time mart or data warehouse (they don’t have back-end consuming functionality) Can have high costs of implementations and maintenance Traditional BI “il prodotto dell’analisi di dati quantitativi di business. È destinata a produrre riflessioni atte a consentire ai responsabili aziendali di operare decisioni consapevoli e informate, oltre che a stabilire, modificare e trasformare le strategie e i processi di business in modo da trarne vantaggi competitivi, migliorare le performance operative e la profittabilità e, più in generale, raggiungere gli obiettivi prefissati.” Self Service BI: QlikView, Tableau, Zoho Self Service BI Business Intelligence and Self service Business Intelligence Definitions BI Goal How To Select A Software Stack BI vs Self service BI Which Technology Which Vendor Business Intelligence – Different context Who decides which and how many Reports? Who and how share BI results? IT dept decide a set of fixed results Local, on premise Through a company web Serve Managers can build their own reports On Cloud, pay as you go Commercial Software starting from 10k/y Free Software under GNU license (free) How much can I pay for? On-premise, Local, Personal solution (Microsoft BI, QlikView, Tableau Desktop) Data preparation (ETL, modeling and eventual DWH), exploratory and advanced data analysis (evaluation, processing) to get insights, results delivery (KPI, reports, dashboard, widget) and publishing (sharing) are developed in a stand alone computer. Results are not delivered through web, but can be shared as embedded objects (Es 2). Es 1: http://morgana.unimore.it/bordoni_stefano/BI2013/Cap4/Dashboard%20libro%20%20con%20codice.xlsm Es 2: https://sites.google.com/site/bompaniste/ Soluzione Web-based (Pentaho, SAP-BO, Tableau Server) Data preparation (eventual DWH), analysis and results are produced, delivered and shared through a Web server. It requires a data server, a web server and a server BI software Es: http://155.185.65.51:8080/pentaho Cloud Solution -> Software as a service (Tableau Online) Data preparation (DWH) must be developed internally, while data evaluation and visualization can be obtained on premise or through Saas resources On Cloud. Results delivering and sharing is always On Cloud. Es: http://public.tableausoftware.com/views/dashDemo/Dashboardfinale?amp;:embed=y&:displ ay_count=no https://public.tableausoftware.com/views/gettingstart/SalesDashboardGettingstart?:embed= y&:display_count=no Cloud Solution -> Software as a service (Zoho) Data preparation (DWH) must be developed internally, while data processing, visualization and sharing are developed on Cloud through Saas resources. When an internal DWH is available, I can buy any requested hardwaresoftware capabilities “On demand” and “Pay as you go” from the Saas provider. Es: https://reports.zoho.com/ZDBDataSheetView.cc?DBID=4000000037680 Hype cycle for business intelligence Business Intelligence and Self service Business Intelligence Definitions BI Goal How To Select A Software Stack BI vs Self service BI Which Technology Which Vendor Gartner is a private, undisputed authority on IT industry rating https://public.tableau.com/views/GartnerMag icQuadrant/GartnerMagicQuadrant?:showVizH ome=no COMPARISON CRITERIA Integration BI infrastructure Metadata management Development tools Collaboration Information Delivery Reporting. Dashboards Ad hoc query Search-based Mobile BI Analysis Online analytical processing (OLAP) Interactive visualization Predictive modeling and data mining Scorecards 3 environments: Tableau, Microsoft, Zoho Local, server, cloud, embedded Local, server, cloud, embedded Cloud, embedded Data Mart – Self service BI Technologies OLAP (Online Analytical Processing): • An OLAP Cube’s main purpose is to enable fast performing multidimensional slicing and dicing. OLAP achieves fast performance by precalculating metrics (field aggregations) for all sets and subsets of unique values in all dimensions (fields) ‘over-night’. This avoids performing these slow operations in real-time during the workday. Storing the results of these pre- calculations takes exponentially more storage resources than the actual raw data does, limiting the size of raw data that can make up a cube to gigabyte scale. • IMDB (In-Memory Databases): Similar to OLAP, IMDB’s primary purpose is fast performance. It achieves fast performance by loading the entire data mart into RAM, thus avoiding slow disk-reads (“I/O Bottlenecks”). The size of data mart is effectively limited by the size of RAM, which today is limited to gigabytes in size. BI typical techniques based on OLAP - Cubes - Pivoting Slice and dice: filtering data into smaller parts, and repeating this process until arriving at the right level of detail for analysis Drill down: move deeper into a chain of data, from high-level information to more detailed. Move downward through a data hierarchy Drill through: moves horizontally through a database showing the underlying data which produced a certain outcome ‘One source, one truth’ vs ‘The western world is being measured using Excel’ Advantages Disadvantages Good quality: Excel is a good, low cost software to develop BI solutions Tool familiarity. Excel is western world calculator. No time or cost for training and support. Users can easily help each other Built in flexibility: analysis can be done or changed as and when needed Rapid development: no IT load or support required and less time to develop quickly Hands-off application: when linked to a data source, any elements of the worksheet refresh automatically and offer real time solution Little to no incremental costs: any application has no additional license fee or training costs Excel isn’t scalable: it’s difficult to distribute, manage and control Excel application for high number of users Excel produces multiple version of truth: it’s not a problem of Excel, but personal copies of worksheets may be changed and manipulated Hidden errors: poorly constructed spreadsheet often contain errors Spreadsheet spawn spreadsheet: any additional user of a spreadsheet tends to change it. When spawn again, it spreads throughout the organization out of control nor documentation Lack of referential integrity: it’s very hard to understand where the data come from, when changes were made and what rules lie behind formulas Business Intelligence and Self service Business Intelligence Examples BI Goal How To Select A Software Stack BI vs Self service BI Which Technology Which Vendor Widget showing revenue Sales Dashboard dashboard Mobile Report BI Widgets Soluzione professionale analisi dati Kyros Soluzione originale Self Service BI con Excel 2010 4 requisiti fondamentali: congruità e tempestività nella fase ETL costruzione di una Staging area, DWH e Data marts risultati corretti, aggiornati, intuitivi, interattivi e navigabili separazione dati, elaborazione, presentazione e distribuzione SSBI process in Excel Data modeling The ETL phase (Extract, Transform and Load) and data modeling is resolved by the the command GET External DATA or through Power Pivot which creates a permanent connection to data source and a data model. Both Excel and PowerPivot use ROLAP (Relational On Line Analytical Processing) technology to model a cube from a dataset or a relational database (DBMS) Data analysis It can be realized by simple formulas, pivot tables, charts, or using the new set of Power BI tools (Power Query, Map, View, Pivot, KPI), that pull source data from Staging Area. Data Presentation visuals are the end-result of any business intelligence undertaking. The outcome of data analysis shall be transformed through reports, tables, pivot tables, charts, dashboard, widget or a mix of visualizations to provide consistent information and insights Publishing and sharing The product's client user interface for the BI consumer-role can be local on desktop or Webbased. There are many different possible solutions to publish and share results on a host web service. E.g. Power BI is a business analytics service which fits into the cloud-based computing model of Software-as-a-Service (SaaS) Definizione di layout, misure e metriche SALES DASHBOARD 2013 Revenue analysis Quali sono i risultati di fatturato degli ultimi due anni? Quali tendenze evidenzia l'andamento mensile delle vendite? Quali tendenze evidenzia un grafico a tachimetro che confronta i risultati di fatturato negli ultimi due anni? Quali tendenze evidenzia un grafico a termometro che confronta i risultati di fatturato negli ultimi due anni? Customer ranking Product analysis Quali sono i dati di fatturato dei primi/ultimi 5 clienti? Quali sono i dati di fatturato dei primi/ultimi 10 prodotti? Geographic analysis Come si distribuiscono per area geografica i valori di fatturato e quantità vendute? Salesman analysis Quali sono i dati di vendita di un venditore a scelta rispetto ad una certa area? Cosa mostra il grafico delle vendite di un venditore a scelta rispetto all'andamento medio? Cosa mostra il confronto tra i grafici delle vendite degli ultimi due anni per Macro area ? Quali sono i primi 5 venditori per volumi di fatturato? Quali dati ci servono? Risultati desiderati dal management 1. Classifica Clienti - Migliori/Peggiori (Top/Bottom) 5 2. Ranking Clienti (Gold, Silver, Bronze) 3. Venduto per Cliente 4. Analisi vendite per Agente 5. Trend e benchmark 6. Analisi vendite per Area di vendita 7. Analisi vendite per Prodotto - Migliori 10 In quali archivi si trovano le variabili che consentono di costruire i Risultati? Quali variabili consentono di costruire i Risultati? Variabile Classifica Clienti Ranking Clienti Ven dut o per Clie nte Trend e benchm ark Analisi Vendut o Area X X X X X X X X X X X X X X X X X X X X X X Risultato CUSTOMER_DESCRIPTION INVOICE_DATE INVOICED_QUANTITY INVOICED_AMOUNT_EUR MACRO_AREA SALE_AREA SALESMAN_DESCRIPTION ITEM_CATEGORY_DESCRIPTION Analisi vendite per Agente Vendita per Prodott o X X X X X X X X X Nome Variabile 1) CUSTOMER_DESCRIPTION 2) INVOICE_DATE 3) INVOICED_QUANTITY 4) INVOICED_AMOUNT_EUR 5) MACRO_AREA 6) SALE_AREA 7) SALESMAN_DESCRIPTION 8) ITEM_CATEGORY_DESCRIPTION Tabella DWH_CUSTOMER DWH_SALE_ORDER_LINE DWH_SALE_ORDER_LINE DWH_SALE_ORDER_LINE DWH_SALE_ORDER_LINE DWH_SALE_ORDER_LINE DWH_SALESMAN DWH_SALE_ORDER_LINE Situazione Venduto Trend e benchmark Classifica Clienti Top/Bot 5 Migliori 10 Prodotti Venduto per Cliente Analisi Venduto Area Vendite per Agente Intelligence geomarketing uses geolocation (geographic information) in the process of planning and implementation of marketing activities: • Where do my customers live, and where do they buy? • Who are my competitors? • Visualize any data in a geographic context by linking it to a digital map • Where should I locate my next Point of Sale? • Creation of sales territories • Where are my sales representatives? Driving time zones, saleseman performance analysis • Are products sold better in particular zones? Where and why? Data mining – Analytical CRM Process of extracting previously unknown, valid, actionable information from large database and then using that knowledge to make crucial business decisions Customer Relationship Management (CRM) is a business strategy that optimizes revenue and profitability while promoting customer satisfaction and loyalty. CRM technologies enable strategy, and identify and manage customer relationships, in person or virtually. CRM software provides functionality to companies in four segments: sales, marketing, customer service and digital commerce (Gartner) was coined in the 1990s and stands for: “managing the relationship with your customer”. With CRM, you can store customer and prospect contact information, accounts, leads and sales opportunities in one central location, ideally in the cloud so the information is accessible by many, in real time. Today it is used to describe IT systems and software designed to help you manage and increase the profitability of customers over their entire relationship with the business (Customer’s Lifetime Value ) is a “customer centric” business strategy that includes relationship, range of products and services, software and IT tools to store and analyse data KDD (knowledge discovery in databases) Problem solving space (business objectives determination) Data preparation Data selection Data preprocessing (Cleaning data, invalid, noise-outliers, missing value, skew distribution) Data transformation Data mining Analysis of results (post processing) Assimilation of knowledge (call to action) KDD process Inductive analysis (Data and Text mining) 1. Link Analysis 2. Cluster Analysis - Segmentation 3. Predictive Modeling - Classification 4. Data Dimensionality Reduction Discovering the associations between the products or services that customers tend to purchase together. In particular: Detect the most associated product with all the others (loss leader) Detect the best consequent for any antecedent (e.g. in a center aisle box) Detect the best antecedent (shirt?) for each consequent (tie) to locate the best placement in the store (e.g. shelves) or in a web e-commerce page Detect products to create a bundle Detect association in phone tapping Detect crime and fraud pattern DATA and TEXT MINING Inductive Approach Discovering the associations between the products or services that customers tend to purchase in a sequence over time How to manage stock with a proper inventory policy (sequences of sales) Detect promotions to offer in sequence (sequence of products or services) How to check that the users of a Web site reaches a page following the desired path (clickstream analysis) Partition a database into segment of similar, homogeneous records. Detect n ideal customer profiles Detect the impact of a marketing campiagn towards lead, prospect or potential customer Detect migration from a segment to another Analyze an existing database to determine regular occurence between data and results Churn analysis (detect customers who leave the company), who buy what, who do specific actions (join the club… ) Who reply a mail for a better targeted campaign and higher redemption Analysis to determine the meaning, purpose, or effect of any type of communication, by studying and evaluating the textual content of: A market survey An advertising campaign Level of satisfaction of employees practice of examining large collections of written resources in order to generate new information (text mining) Descriptive statistical methods that generate graphical displays from data matrices Competitive Product Positioning Map – perceptual map Brand map (words or phrases are most associated with different suppliers) Associatian discovery and Sequence pattern discovery Cluster Analysis (static and dynamic) Predictive Analysis (Classificazione) Content Analysis And Text mining Correspondence analysis -Factor analysis Link Analysis Cluster Analysis Tree Classification NO Link analysis: associations discovery (A priori, Agrewal 1994) Association Discovery An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item A found in the data. A consequent is an item B that is found in combination with the antecedent. MBA discovers the probability (confidence) of the co-occurrence of the two (or more) items (product A and B) within the same event (basket with A), the frequency of this combination over total transactions (rule support) and the eventual increased probability to sell B, once sold A (lift) Sequential pattern discovery discover all significant sequential patterns (e.g. in products selling) Link analysis: associations discovery MBA metrics (KPI) Item Support: is an indication of how frequently the items A appear on total transaction (The minimum frequent itemset support is a user-specified percentage that limits the number of itemsets used for association rules) supportA NN N supportA B Association Support N Association confidence: frequency of co-selling of A->B on N the total transactions (basket) with A confidence A B A A B A B NA Lift: leverage effect on B sales probability once sold A lift A B confidence A B supportB Association Discovery: MBA Discover the associations between the products that customers tend to purchase together The association rule: Stout Lemonade 1. Lemonade is present (sold) once every two selling (basket) of stout beer (confidence) 2. Probability to sell lemonade (lift) raises 2.7 times 3. Co-selling appear on 3% of total transactions (rule support) 267*11,6%= 31 267*3,75%= 10 267*5,24% = 14 3%=8 Acqua minerale 31 Antigelo 10 Sapone A14 Sequential pattern discovery Clickstreams analysis: explores the route that a certain number of visitors choose when clicking or navigating through a website Cluster Analysis (Segmentation) is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are very similar to one another and very different from the items in other clusters. Customer segmentation divides the customer base into distinct and internally homogeneous groups, such as low value, high value customers, and low value customers with high spending potential. – Objective / variables – Clustering method (Hierarchical / non Hierarchical) – Distance Measures (e.g. Euclidean distance for quantitative variables, Jaccard index , Condorcet for qualitative variables) – Technique (k-means, Demographic, Kohonen’s maps, EM (Expectation Maximization) – Cluster Profiling Segmentation Variables Demographic: is one of the most commonly used forms of segmentation as it is one of the best ways to diversify individuals. Variables: – – – – – – – – – – Gender Age Income Family Status, size, life cycle Education Occupation Ethnicity Life stage Nationality Social economic ranking Segmentation Variables Psycographic: uses people personality, lifestyle, their activities, interests, opinions as well as the psychological aspects of consumer buying behavior into account. Psychographic data used for segmentation purposes are typically from research instruments, such as surveys, polls, and queries. Main variables are: – – – – – – Values (ideal, achievements, religion…) Attitude (primary motivation, self expression) Lifestyle (social life, hobby…) Activities (job, sport, travels…) Interests (family, home, media…) Opinions (politics, culture, money…) Segmentation Variables Customer type – buyer, shopping behaviour: explores the economical features of customers – – – – – – Loyalty to supplier Usage rate/ buying pattern Order size Buying on occasions Benefits sought Type of customer (lead, qualified lead, prospect, potential, active, loyal customer) – Type of buyer (ABC class, avg ticket, quantity, classes of purchased goods, regular, occasional) – Attitude towards product, channel – Profitability Segmentation Variables Website’s navigation visitors (web mining – web analytics). Customers’ profiling through log files and navigation style): – Recency: the time elapsed since the last purchase – Frequency: the number of purchases made over a defined time period – Monetary value of the monetary – Time spent on the site (stickiness) – Sequence e number of pagine visitated (clickstream analysis) – Access link (banner or direct access) – Value of the past purchases (or of a visiting session) – Purchase yes/no Desirability of a segmentation model not all of the achieved segmentations are meaningful: Measurable: The characteristics of the segments can be measured and identified Substantial: The segments should be large and profitable enough, and worth investing in. They should also be feasible to reach with a tailored marketing program Accessible: The segments should be meaningful to the company and be effectively reached and served Differentiable: Each segment needs to be distinguishable from others and to respond differently to a marketing program Actionable: Effective marketing programs can be carried out for the selected segments by considering the objectives and resources of the company Segmentation goals Static: One – time (occasional) modeling you don’t run on a regular basis • P ideal profiles (personas) are based on data points and need to be run only once. A static segmentation is useful to create profiles, one-off campaigns, identify tribes, targeted sales support Dynamic: changes in segment structures and composition (it can identify temporal patterns of customers’ characteristics and their purchasing behavior e.g. emerging, disappearing and changing segments ) Dynamic: changes in segment memberships of individual customers (segment migration analysis) to identify patterns and movements among segments of leads, prospect or high potential targeted customer Predictive Modeling: (Classification tree) • If the target (output, dependent) variable is categorical we use Classification tree, Kohonen maps or Logistic Regression, if quantitative Neural network, Linear regression or Radial Basis Function • The aim of classification is to unveil the relationship between explanatory (independent) and target (dependent) variables and to predict future events by analyzing historical and current data. The model ranks the importance of independent variables to explain and predict the value of a target variable • A classification model is trained based upon training data with labeled classes, aiming to search for relationships between explanatory variables and the target variable • A decision tree works by dividing data into smaller sets by recursively applying two way and/or multi-way splits to find the best variables for the partition at each split Classification tree Income Range Life insurance Credit card insurance Gender Age 40-50 No No M 45 30-40 Yes No F 40 40-50 No No M 42 30-40 Yes Yes M 43 50-60 Yes No F 38 20-30 No No F 55 30-40 Yes Yes M 35 20-30 No No M 27 30-40 No No M 43 30-40 Yes No F 41 40-50 Yes No F 43 20-30 Yes No M 29 50-60 Yes No F 39 40-50 No No M 55 20-30 Yes Yes F 19 Best explanatory variable Income Range 20-30 30-40 40-50 50-60 Yes 2 4 1 2 9 No 2 1 3 0 6 15 Yes Corr. Prediction (precision, purity) 11 over 15 = 73% Branches = 4 Yes Information Gain for Attribute = 11/15/4 =0,183 NO Yes Best explanatory variable Credit card insurance Yes No Yes 3 6 No 0 6 Yes NO Corr. Prediction (precision, purity) 9 su 15 = 60% Branches = 2 Information Gain for Attribute = 9/15/2 =0,300 Best explanatory variable Age <43 >43 Yes 9 0 No 3 3 Yes NO Corr. Prediction (precision, purity) 12 su 15 = 80% Branches = 2 Information Gain for Attribute = 9/15/2 =0,400 Full tree Age >43 <=43 No (3/0) Gender Male Female Credit card ins. Yes (6/0) No No (4/1) Yes Yes (2/0) Number of record correctly classified = 14 over 15 Classification (confusion) Matrix Yes (predicted) No Totale actual (predicted) Yes (people who actually bought the life insurance) 8 1 9 No (people who actually didn’t buy the life insurance) 0 6 6 Total predicted 8 7 15 •true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease. •true negatives (TN): We predicted no, and they don't have the disease. •false positives (FP): We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.") •false negatives (FN): We predicted no, but they actually do have the disease. Number of record correctly classified = 14 over 15 Number of false positives = 0 Number of false negatives = 1 Overall accuracy of the model: 93%