Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Business Intelligence Workshop, Helia, May, 2008 Fritz Laux Reutlingen University DBTechNet Data Warehousing (DW) Online Analytical Processing (OLAP) Data Mining © F. Laux Topics Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 1. Introduction to BI and CPM 2. ETL Process 5 Data Mining duction 2 ETL Process 3 DW Modeling 3 DW Model 4 OLAP 1 Intro 3. DW Modeling 4. OLAP 4 OLA P 5 Data Minin g 5. Data Mining 2 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 1 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Critical Questions About an Enterprise Fritz Laux Reutlingen University 1 Intro to BI 2 ETL ªAre we on the right way? ) Yes, we are! ªHow about our competitors? 3 DW Model 4 OLAP 5 Data Mining ªEconomical trends? 3 /70 © F. Laux fm ipl.-K of. D © Pr oth . A. R Critical Questions About an Enterprise Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model ªAre we on the right way? ) Yes, we are! ªHow about our competitors? ) Ahead of us! 4 OLAP 5 Data Mining ªEconomical trends? ) Turbulences! 4 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University of. © Pr Dipl. -Kfm oth . A. R 2 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Where do we get the Knowledge from? Fritz Laux Reutlingen University ªAbout the enterprise ) From the company’s operational information systems 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªAbout the market and competitors ) From census bureau ) From public statistical data ªAbout economical trends ) From financial and economical publications ªHow you gather, manage, and use information will determine whether you win or lose (Bill Gates, Business @ The Speed of Thought, 1999) 5 /70 © F. Laux ªSo, where is the problem? Definition and Problems to solve in Business Intelligence Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ª Definition: Business Intelligence (BI) refers to processes and technologies using fact based systems to analyze business ª BI needs to deal with: 1. 2. 3. 4. 5. Information overload Missing knowledge We do not know which are the right questions We do not know the influencing factors and their impact Key measures or indicators to steer an enterprise are missing 6 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 3 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Information Pyramid Fritz Laux Reutlingen University Data Data Mining Mining 3 DW Model Growing knowledge Ma rk et 5 Data Mining OLAP OLAP Vi ew 4 OLAP DDSSSS Data DataWarehouse Warehouse l tionnaal OOppeerraaetio s m SSyysst tems OLTP OLTP IT ITView View 7 /70 © F. Laux eww Viei sV esss nine ussi BBu 2 ETL ) IT view ) Business View ) Market View SIS EEI 1 Intro to BI ª Aspects Amount of information ªWe're drowning in information and starving for knowledge. (Rutherford D. Rogers, Yale, 1985) Motivation Fritz Laux Reutlingen University ªWhat is the goal of my organization? ªHow do we affect the market? 1 Intro to BI 2 ETL 3 DW Model ªHow do we perform? 4 OLAP 5 Data Mining 8 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University of. © Pr Dipl. -Kfm oth . A. R 4 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Motivation Fritz Laux Reutlingen University ªBusiness Intelligence as critical success factor ª Purpose: Support business decision making 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 9 /70 © F. Laux fm ipl.-K of. D © Pr oth . A. R Corporate Performance Management (CPM) Fritz Laux Reutlingen University ªHow can we steer an enterprise? set goals start plan 1 Intro to BI 2 ETL 3 DW Model re-plan execute 4 OLAP 5 Data Mining analyze monitor Idea from MIK AG: http://www.mik.info ªBI Tools provide the means to steer an enterprise by ) Measuring the effect of decisions and ) Analyzing the performance and ) Compare with goals 10 /70 © F. Laux ªDefinition: CPM is the framework for steering an enterprise by means of Business Intelligence (c) 2008, Fritz Laux, Reutlingen University 5 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server How Can we Measure Corporate Performance? Fritz Laux Reutlingen University 1 Intro to BI ªThrough Key Performance Indicators (KPIs) ) Definition: KPI is a metric to define and measure state and progress towards an organization’s goal set ) Usually high level relative plan values goals 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªExamples re-plan KPIs ) Customer execute Ö Customers satisfaction Ö Customer attrition (loss) ) Manufacturing analyzeKPIs monitor Ö Overall Equipment Effectiveness OEE = Availability * Performance * Quality ) Financial KPIs Ö Profit Margin PM = Net Income / Sales Ö Return on Investment ROI = Turnover * Earnings / Sales = … 11 /70 © F. Laux Return on Investment (ROI) Fritz Laux Reutlingen University ªFinancial KPIs have „natural“ metrics 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining Source: Fred Nickols, 2000, originally by Johnson and Kaplan 12 /70 © F. Laux ªBut how about soft factor metrics? (c) 2008, Fritz Laux, Reutlingen University 6 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Soft Factor Metric Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model ªExample: Customer satisfaction ) General satisfaction ) Specific satisfaction: quality/price of product, speed of delivery, … Ö How do we compare these? 4 OLAP 5 Data Mining ªSearch for a mapping of categorical values to ordinal values ) Totally satisfied (ts) Æ 9 ) Partially satisfied (ps) Æ 3 ªMeaning of the metric 13 /70 ) ts = 3 * ps? … No! But ts is-better-than ps ) Are two metrics comparable? … No! But we do weighted comparisons. © F. Laux Motivation Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªWhy can‘t we use our OLTP System? ) Missing information Ö Need for integration of economical and census data Ö Need for soft factors to assess an enterprise ) Missing KPIs and steering parameters Ö Need for highly significant KPIs and parameters ) Influencing factors and different perspectives not available Ö Need for multidimensional analysis and presentation 14 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University Source: One Hundred & Eighty Degrees Systems Limited. 2004 7 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Motivation Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªWhy can‘t we use our OLTP System? ) Queries only explicit information y Select customer, sum(sales) from Orders where Region …. Group by … Ö We don‘t know what to ask! Ö Need for interactive, explorative analysis ) Inappropriate presentation of information Ö Tabular presentation one dimensional analysis Sales ok? Trend ok? Reason? 15 /70 © F. Laux Ö We can’t see the problem! Ö Need for multidimensional analysis and presentation Management Cockpit Fritz Laux Reutlingen University ªThe CPM paradise 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining Source: Juergen Daum, New Economy Analyst Report, 2004 16 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University Source: SAP Whitepaper, SAP SEM / CPM, http://help.sap.com/ 8 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server The Business Intelligence Process Fritz Laux Reutlingen University Data Warehouse Data Sources Analysis Cubes, Data Marts 1 Intro to BI 2 ETL xls 3 DW Model 4 OLAP OLAP 5 Data Mining DBS stats Data Mining ETL WWW WWW Product Build up 17 /70 Design © F. Laux Time Region Extraction Transformation Loading Fritz Laux Reutlingen University Data Warehouse Data Sources Analysis Cubes, Data Marts 1 Intro to BI 2 ETL xls 3 DW Model 4 OLAP OLAP 5 Data Mining DBS stats ETL WWW WWW Product 18 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University Data Mining Time Region 9 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Data Sources Fritz Laux Reutlingen University ªTechnical data sources supported by SQL Server Integration Services (SSIS) ªGeneral sources ) Time ) Geography 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªOLTP ) Master data ) Transaction data ªPlanning ) Planning turnover ) profit, etc ªEconomic data 19 /70 © F. Laux ) Business sector data ) Economic forecast Extract and Transform Fritz Laux Reutlingen University ªSelect which data are needed? ªCleanse where are the user data? ªConvert have all facts the same unit, coding and granularity? ªHarmonize have we synonyms and homonyms? ªAdjust grouping, classification? ªCorrect are the data correct? ªAmend are the data complete? 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 20 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 10 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Extract and Transform Example Fritz Laux Reutlingen University 1 Intro to BI 2 ETL ª Select ª Cleanse ª Convert 3 DW Model 4 OLAP 5 Data Mining e.g. http://.../consumptionPerCapita/coffee.html e.g. strip off html tags e.g. convert consumption into kg ª Harmonize e.g. import with consumption? ª Adjust e.g. region grouping ª Correct e.g. incorrect value for D 1989 ª Amend e.g. for NL 1988 21 /70 © F. Laux <table border="1" … width="21%"> <tr> <td width="58%">Country</td> <td width="45%">1987</td> </tr> <tr> <td width="58%">Finland</td> <td width="45%">12,04</td> </tr> </table> Country 1987 1988 1989 Finland 12,04 ? 11,68 Sweden 11,64 11,71 11,08 Norway 20,13 lb 20,81 lb 18,19 lb 11 10,65 10,2 Benelux 19,65 20,48 19,89 Austria 7,75 8,17 8,01 Germany 7,38 8,17 0,827 Denmark Hands on Lab: Integration Services (SSIS) Fritz Laux Reutlingen University 1 Intro to BI 1. Open SS Business Intelligence Studio 2. Create new project 2 ETL 3 DW Model 4 OLAP 5 Data Mining 3. Select 22 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 11 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Hands on Lab: Integration Services (SSIS) Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3. Build a control flow 3 DW Model 4 OLAP 2. Design a data flow from source to destination 5 Data Mining source destination 1. Define connection managers for data sources and destinations 23 /70 © F. Laux Hands on Lab: Integration Services (SSIS) Fritz Laux Reutlingen University ªGraphically design control and data flow ªExample 1: Loop control, data and error flow 1 Intro to BI 2 ETL 3 DW Model 4 OLAP Control loop Text file data source 5 Data Mining error flow Data flow 24 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 12 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Hands on Lab: Integration Services (SSIS) Fritz Laux Reutlingen University ª Example 2: ETL control flow design & a data flow taking date entries from sales and purchase orders to build date dimension Start of control flow 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining Excel data source Data transformation End of control flow Destination DW 25 /70 © F. Laux Data Warehouse Modeling Fritz Laux Reutlingen University Data Warehouse Data Sources Cubes, Data Marts 1 Intro to BI 2 ETL xls 3 DW Model 4 OLAP OLAP 5 Data Mining DBS stats ETL WWW WWW Product 26 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University Data Mining Time Region 13 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Data Warehouse Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP ªDefinition: „A data warehouse is a ) subject-oriented, ) integrated, ) time-variant, ) nonvolatile 5 Data Mining collection of data in support of management’s decision-making process.“ William H. ‘Bill’ Inmon (1996) 27 /70 © F. Laux Data Warehouse Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªProperties ) Subject-oriented Ö data is selected and organized so support business analysis Ö Optimized for query and analysis Ö Objects (facts) and their determining factors (dimensions) are linked together Ö Not to support OLTP ) Time-variant Ö accumulates historical data over time ) Non-volatile (archival) Ö Data is read-only; it is never updated, only added Ö May have redundancies Ö Contains pre-calculated aggregations ) Integrated 28 /70 Ö contains data from different sources (OLTP systems, economical databases, etc) © F. Laux (c) 2008, Fritz Laux, Reutlingen University 14 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server 5.3 Dimensional Fact Model ªProperties Fritz Laux Reutlingen University ) Multidimensional model ) Distinction between fact (measures) and dimension ) Structural dimensions ) Attributes of Dimension ) computed values 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining Fact measure Comp. value Sales average, semi-additive Dimension Year Month Dim.attribute Week amount onStock value weight Type Product Prod.group 29 /70 © F. Laux 2.1a Taxonomy of Facts Fritz Laux Reutlingen University 1 Intro to BI 2 ETL Fact 3 DW Model 4 OLAP 5 Data Mining numerical additive semi-additive categorical ordinal nominal temporal 30 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 15 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server DW Schemes Fritz Laux Reutlingen University ªStar : one Fact table, multiple Dimension tables 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªGalaxy: multiple Fact table, multiple Dimension tables ªSnowflake: Dimension tables normalized, Fact tables aggregated 31 /70 © F. Laux ªAll 3 Schemata are relational models in disguise Example Star Scheme Fritz Laux Reutlingen University ª SSAS Source View 1 Intro to BI 2 ETL 3 DW Model 4 OLAP Dimension table Fact table 5 Data Mining 32 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 16 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Example Galaxy Scheme Fritz Laux Reutlingen University ª SSAS Source View 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining Joint dimension table Fact tables 33 /70 © F. Laux Example Snowflake Scheme Fritz Laux Reutlingen University ª SSAS Source View Normalized product dimension Aggregated fact table 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 34 /70 © F. Laux Fact table (c) 2008, Fritz Laux, Reutlingen University 17 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Design Rules for DW Scheme Fritz Laux Reutlingen University ªUse Star if ) Dimensions have few or dynamic Attributes ) Measures are orthogonal 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªUse Snowflake if ) Dimensions are structured (aggregation) ) Measures are orthogonal ªUse Galaxy if ) Dimension are reused ) Measures are not orthogonal 35 /70 © F. Laux Hands on Lab: SQL Server Management Studio Fritz Laux Reutlingen University 1. Start the SQL Server Management Studio 2. Create a new database 1 Intro to BI 2 ETL 3. Add a new database diagram 3 DW Model 4 OLAP 5 Data Mining 36 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 18 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Hands on Lab: SQL Server Management Studio Fritz Laux Reutlingen University 4. Create tables 5. Define foreign keys enter table definition 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining Manage keys, relationships Drag and drop columns to define foreign keys 37 /70 © F. Laux Modeling Cubes, OLAP Fritz Laux Reutlingen University Data Warehouse Data Sources Cubes, Data Marts 1 Intro to BI 2 ETL xls 3 DW Model 4 OLAP OLAP 5 Data Mining DBS stats ETL WWW WWW Product 38 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University Data Mining Time Region 19 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server 5.2 Cube Model Fritz Laux Reutlingen University ªMultidimensional view of the Data Warehouse )Dimensions correspond with coordinates )Structured Dimensions )Facts are a function of multiple dimensions 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining E240 country car vehicle truck product Fact: sales = f(product, country, time) C220 time 39 /70 © F. Laux 5.4 object oriented model Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªObject-oriented view of the Data Warehouse ) „Intelligent“ dimensions and Facts: Ö Meta-information for dimensions and facts ) Example: Ö Product Dimension has hierarchical aggregation Ö costs can be compared with earnings, but not with noOfOrders ) Object oriented structure allows semantically correct navigation and aggregation Hierarchy level child Product 40 /70 © F. Laux #Orders price (c) 2008, Fritz Laux, Reutlingen University Timespan start end Month days 20 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server MS visualization of a hypercube Fritz Laux Reutlingen University ªRelational view on the OLAP cube structure 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 41 /70 © F. Laux MS visualization of a hypercube Fritz Laux Reutlingen University ªPivot table view on the OLAP data ) Drag and drop measures and dimensions on the pivot table 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 42 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 21 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server OLAP Storage models Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model ªMOLAP: Multidimensional (md) storage ) Single cube Æ one large md array with sparse data ) Multi-cube Æ galaxy structured md arrays ) Storing md array on a linear address space ) Optimized OLAP for small cubes 4 OLAP 5 Data Mining ªROLAP: Relational storage ) Storing facts and dimensions in tables ) Storing aggregations in tables ) Best choice for very large cubes ªHOLAP: Hybrid storage 43 /70 © F. Laux ) Storing facts and dimensions in tables ) Storing aggregations as ms arrays ) Best performance for large cubes Hands on Lab: SSAS Cube Design Fritz Laux Reutlingen University 1 Intro to BI ª Start SQL Server Business Intelligence Studio ª Create a new SSAS project ª Add Data Source, View, and create a new cube 2 ETL 3 DW Model 4 OLAP ª Identify fact and dimension tables 5 Data Mining 44 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 22 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Hands on Lab: SSAS Cube Design Fritz Laux Reutlingen University 1 Intro to BI ª Select measures ª Define dimensions and aggregation hierarchies ª Save cube definition 2 ETL 3 DW Model 4 OLAP 5 Data Mining 45 /70 © F. Laux Hands on Lab: SSAS Cube Design Fritz Laux Reutlingen University ª Select storage model and its parameters ª Process and deploy cube 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 46 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 23 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Hands on Lab: performing OLAP Fritz Laux Reutlingen University 1 Intro to BI ª Drill down – Roll up ª Slice and Dice ª Drill through 2 ETL 3 DW Model 4 OLAP 5 Data Mining 47 /70 © F. Laux Data Mining Fritz Laux Reutlingen University Data Warehouse Data Sources Cubes, Data Marts 1 Intro to BI 2 ETL xls 3 DW Model 4 OLAP OLAP 5 Data Mining DBS stats ETL WWW WWW Product 48 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University Data Mining Time Region 24 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Decision Tree Classification Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªGoal: Mapping/prediction of objects to predefined classes based on their attribute values ªProcess: 1. Build a decision tree DT (classification model) with the help of sample objects (training data) 2. Validation for the DT (e.g. precision) with test data car type ≠ truck = truck 3. Classification of unknown objects Risk = low age > 60 49 /70 ≤ 60 Risk = low © F. Laux Risk = high Regression Tree Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªGoal: Prediction of a numeric value for objects based on a DT with linear regression functions on the leaf level ªProcess: 1. Build a DT with the help car type of training data 2. Replace some branches by ≠ truck = truck a linear regression formula Price = 20k€ + 2k€*weight insurance class 3. Generate prediction values tune regression < III > VI parameters Price = 10k€ + 4. Testing (like DT) Price = 3ok€ + 3k€*class 6k€*class 5. Prediction (like DT) [IV..VI] 50 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University Price = 20k€ + 4k€*class + 10€*HP 25 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server SSAS Decision Tree Viewer Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 51 /70 © F. Laux SSAS Dependency Network Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 52 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 26 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server SSAS Decision Tree Prediction Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 53 /70 © F. Laux Clustering Basics Fritz Laux Reutlingen University ª Clustering (Grouping) := Arrangement of objects into groups, that ) objects in the same cluster are most „similar“ ) objects from different clusters are most „dissimilar“ 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ª Types of clustering ) Partitioning clusters (an object o1 belongs to only one cluster) ) Hierarchical clusters (nested clusters) ª Distance function d: ) d(o1, o2) ≥ 0; d(o1, o2) = 0 ⇔ o1 = o2; d(o1, o2) = d(o2, o1) ª Similarity of o1 and o2 is defined via distance function ) The smaller the distance, the more alike are the objects ª Goal function 54 /70 © F. Laux ) Maximize the compactness of the clusters ) Compactness of a cluster C := |C| / Sumoi∈C (d(oi,c), where c = center of C (c) 2008, Fritz Laux, Reutlingen University 27 Business Intelligence Workshop, Helia, May 2008 f1 Introduction to DW, OLAP, and DM using SQL Server K-Means based Clustering (1/2) Fritz Laux Reutlingen University ª Algorithm: 1. Choose k cluster centers (centroids) 2. Assign each object to its nearest centroid 3. Recalculate the cluster centers (centroids) 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining Beispiel 1 5 a k=2 6 3 2 b 7 4 Initiale Zentroide c und d 55 /70 © F. Laux f2 K-Means basiertes Clustering (2/2) Fritz Laux Reutlingen University ª Algorithm: 1. Choose k cluster centers (centroids) 2. Assign each object to its nearest centroid 3. Recalculate the cluster centers (centroids) 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ª Repeat steps 2-3 until the centroids stabilize Example 1 5 a a* 3 k=2 6 2 b b* 7 4 Initial centroids c and d 56 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 28 Folie 55 f1 Animation für K-Mean Hans Muster; 09.11.2006 Folie 56 f2 Animation für K-Mean Hans Muster; 09.11.2006 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server SSAS Clustering Fritz Laux Reutlingen University ªImplements K-Means and EM Clustering ªBoth are partitioning algorithms ) K-Means is distance based ) EM is probability based ) Scalable means: one single data scan only 57 /70 © F. Laux SSAS Cluster Viewer Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 58 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 29 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server MS Cluster Profile Viewer Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 59 /70 © F. Laux SSAS Cluster Characteristics Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 60 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 30 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server SSAS List Chart Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 61 /70 ªLift = %ofCorrectPredictions / %ofPopulation © F. Laux Association Rules Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªExample (basket analysis) ) Available items I = {Bred, Coffee, Milk, Cake, Butter, Tea} ) Support of X = {Coffee, Milk} Ö Support(X) = 3/6 = 50% ) Support of R = X ∪ {Cake} i.e. Support of Rule: „Milk, Coffee ⇒ Cake“ Ö Support(R) = 2/6 = 33% ) Confidence of Rule: Ö Confidence („Milk, Coffee ⇒ Cake“) = Support(R)/Support(X) = 2/3 = 67% Transaction set T t bought items 1 Bred, Coffee, Milk, Cake 2 Coffee, Milk, Cake 3 Bred, Butter, Coffee, Milk 4 Milk, Cake 5 Bred, Cake 6 Bred 62 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 31 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server SSAS Item Sets Viewer Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 63 /70 © F. Laux Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ªProbability = Confidence 64 /70 ªImportance © F. Laux (c) 2008, Fritz Laux, Reutlingen University 32 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Key Performance Indicators (KPI) Fritz Laux Reutlingen University ªIdea to measure performance of an enterprise with simple numbers as return on investment (ROI), profit, capital turnover 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining ª ROI := Earnings / Investments ª Profit := Revenue – Costs 65 /70 ª Capital turnover := Sales / Investments © F. Laux SSAS Key Performance Indicators (KPI) Fritz Laux Reutlingen University ªKPI = f(measures, goal) ) Measures are compared with a goal function ) KPI is normally analyzed over time 1 Intro to BI 2 ETL 3 DW Model Define new KPI 4 OLAP 5 Data Mining Drag measure to value or goal expression 66 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 33 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server Time Series Fritz Laux Reutlingen University ªDefinition: A time series (TS) is a timely equidistant ordered sequence of numbers ) The ordering is relevant (i.e. following numbers are not independent) ª Additive TS Model ) y(t) := Trend(t) + Season(t) + R(t) (t ∈ {1, 2, 3, …} ) Trend is monotonic (linear or non-linear) ) Season is periodic (sine or other) ) R(t) random value 67 /70 time © F. Laux SSAS Autoregressive Tree Models for Time-Series Analysis Fritz Laux Reutlingen University ª Definition: Let y = (y1, y2, …, yt) be a time series TS. The model for TS is called auto regressive, if for all p <τ ≤ t the probability distribution of yτ depends as a linear regression on the previous p values of yτ -π yτ -1 yτ -p yτ ª Definition: An auto regressive tree model is a piecewise linear autoregressive model, where the boundaries are defined by a decision tree. Yτ-1 < a false P(yt) = N(m1,σ12) true Yτ-1 > b false 68 /70 P(yt) = N(m2,σ22) true P(yt) = N(m3,σ32) © F. Laux (c) 2008, Fritz Laux, Reutlingen University a t b 34 Business Intelligence Workshop, Helia, May 2008 Introduction to DW, OLAP, and DM using SQL Server MS Time Series Fritz Laux Reutlingen University ªUses regression tree 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 69 /70 © F. Laux Fritz Laux Reutlingen University 1 Intro to BI 2 ETL 3 DW Model 4 OLAP 5 Data Mining 70 /70 © F. Laux (c) 2008, Fritz Laux, Reutlingen University 35