Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data-Driven Business Intelligence Systems: Part I Week 5 Dr. Jocelyn San Pedro School of Information Management & Systems Monash University IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 Lecture Outline Data-driven BIS Data warehouse Data warehouse architectures Entity-Relationship Modelling Multi-dimensional Modelling Star Schema An Example: Retail Trading IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 2 Learning Objectives At the end of this lecture, the students will Have better understanding of concepts, tools and technology underlying data-driven business intelligence systems Have knowledge of multidimensional modelling and star schema for data modelling for data warehouses IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 3 Data-Driven Business Intelligence Systems IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 4 Data-Driven BIS Data-driven BIS information systems that provide BI through access and manipulation of large databases of structured data includes tools for “drill down” for more detailed information “drill up” for broader, more summarised view “slice and dice” for a change in data dimensions IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 5 Data-Driven BIS IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 6 Data-Driven BIS IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 7 Data-Driven BIS “Slicing” the $225,764 cube $201,196 $162,504 $182,500 $123,033 $72,528 $116,963 $75,048 $68,792 P r o d u c t Peacock Leverling Fuller Davolio Callahan Suyama King Dodsworth Buchanan Time IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 8 Data-Driven BIS “Dicing” the cube $225,764 $201,196 $162,504 $182,500 $123,033 $72,528 $116,963 $75,048 $68,792 Q3 $23,181 Q2 $16,035 Q1 $6,858 Suyaman King Dodsworth Buchanan $22,719 Peacock Leverling Fuller Davolio Callahan Q4 IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 9 Data Warehouse IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 10 Data Warehouse A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision making process – Bill Inmon (1995) Subject-oriented: focus is on subjects related to business or organisational activity like customers, employees, suppliers (instead of applications-oriented (finance, marketing, production) Integrated: data from various databases is stored in a consistent format through use of naming conventions, domain constraints, physical attributes and measurements Time-variant: associating data with specific points in time Nonvolatile: data does not change once it is in the data warehouse and stored in data warehouse IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 11 Data Warehouse Data warehouse is a copy of transaction data specifically structured for query and analysis – Ralph Kimball (1996) Data warehouse is a specific database designed and populated to provide decision support in an organisation - Gray and Watson (1998) IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 12 Data Warehouse Data warehousing emerged as result of improvements in database technology – relational data model and relational database management systems (DBMS) advances in computer hardware - emergence of affordable mass storage and parallel computer architectures emergence of end-user computing, facilitated by powerful, intuitive computer interfaces and tools advances in middleware products that enable enterprise database connectivity across heterogeneous platforms IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 13 Data Warehouse triggered by recognition of fundamental differences between operational (or production) systems and informational (or decision support) systems Operational system – system that is used to run a business in real time, based on current data – e.g. sales order processing, reservation systems, patient registration Informational systems – designed to support decision making based on stable point-in-time or historical data; for complex read-only queries or data mining applications – e.g. sales trend analysis, customer segmentation, human resources planning IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 14 Data Warehouse Comparison of Operational and informational Systems – McFadden, Hoffer and Prescott 1999 Characteristic Operational Informational Primary purpose Run the business on a current basis Support managerial decision making Type of data Current representation state of the business Historical or point-in(snapshots) Primary users Clerks, salespersons, administrators Managers, business analysts, customers Scope of usage Narrow vs. simple and queries Broad vs. complex and analysis Design goal performance Ease of access and use IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 15 Data Warehouse Architectures Generic two-level architecture Source (file) Source database) Transformation and Integration Data warehouse Source (database) Source (database) IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 16 Data Warehouse Architectures Three-level architecture Source (file) Source (database) Source Transformation and Integration Enterprise Data warehouse Data mart database) Source (database) Selection and aggregation Data mart IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 17 Data Warehouse Architectures Data mart a data warehouse that is limited in scope contains selected and summarised data to support specific decision support applications of specific enduser group e.g., marketing data mart, finance data mart IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 18 Data Warehouse Architectures Three-layer data architecture Derived data Data mart Reconciled data Enterprise data warehouse Enterprise data model Operational data Data mart metada EDW metadata Operational metadata Operational systems IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 19 Data Warehouse Architectures Enterprise data model Presents a total picture explaining the data required by an organisation Must be developed prior to designing a data warehouse Entity-Relationship Models – traditional approach in relational database design Multidimensional Models – are commonly used in data warehouses and data marts for faster retrieval for querying and analysis IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 20 Data Warehouse Architectures Operational Data current or transient, not historical restricted in scope to a particular application poor quality not normalised (there are multi-valued attributes or repeating groups, partial dependencies, transitive dependencies in data relations) IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 21 Data Warehouse Architectures Sample Operational Data from Northwind database IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 22 Data Warehouse Architectures Reconciled Data Detailed - rather than summarised Historical – snapshots, periodic Comprehensive – should reflect enterprise-wide perspective; conform to enterprise data model Quality controlled Normalised – 3NF or higher 3NF – no multi-valued attributes, no partial dependencies, no transitive dependencies IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 23 Data Warehouse Architectures Steps in Normalisation Table with multi-valued attributes Remove multivalued attributes 1st Normal Form Remove partial dependencies 2nd Normal Form Remove transitive dependencies 3rd Normal Form IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 24 Data Warehouse Architectures Sales relation with sample data Relation in 3NF Salesperson Cust_ID Name Salesperson Region Cust_ID IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 Name Region Salesperson 25 Data Warehouse Architectures Derived Data selected, formatted, aggregated provides ease of use for decision support applications provides fast response for user queries supports ad-hoc queries and data mining applications data model commonly used is star schema IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 26 Data Warehouse Architectures Metadata data that describe the properties or characteristics of other data Operational metadata – describe the data in various operational systems (as well as external data) that feed the EDW EDW metadata – describe the reconciled data layer as well as the rules for transforming operational data to reconciled data Data mart metadata – describe the data in derived data layer and rules of transforming reconciled data to derived data IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 27 Data Warehouse Architectures Sample data description IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 28 Data Warehouse Architectures Sample data description IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 29 Data Warehouse Architectures Data Reconciliation Process Stage 1: Initial load, when EDW is first created Stage 2: Subsequent updates Steps in Data Reconciliation Process Capture – extract relevant data from source/s IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 30 Data Warehouse Architectures Scrub – clean or upgrade the quality of raw data before transformation and loading (using pattern recognition, artificial intelligence techniques) Track and correct errors: misspelled names, erroneous birthdates, missing data; inconsistent data formats IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 31 Data Warehouse Architectures Transform - includes converting data format or representation from source to target system partitioning data according to predefined criteria aggregating data from detailed to summary level IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 32 Data Warehouse Architectures Load and Index Refresh mode – filling the EDW by bulk rewriting of target data Update mode – only changes in source data are written to the data warehouse; at periodic intervals, data warehouse is rewritten, replacing previous contents without overwriting or deleting previous contents Create necessary indexes IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 33 Entity-Relationship Modelling IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 34 Entity-Relationship Modelling Customer Type groups Customer within contains makes Product Type groups Product in Sale Region located at Store within Period (based on Kimball (1996), p29, and Simsion-Bowles (1996), p2) IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 35 Entity-Relationship Modelling Entities, attributes and relationships Rules of normalisation 3NF is typical Protection of integrity of database by avoiding anomalies Every logical thing is represented only once Separate consideration of logical and physical aspects IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 36 Entity-Relationship Modelling ER Model for the Northwind sample database IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 37 Entity-Relationship Modelling Large numbers of tables Oracle Financials - 1,800; SAP 7 up to 8,000 Commonly used Feels natural once you get used to it Research shows that they are not easily understood by IT people Especially concepts like abstraction, generalisation, sub-types, etc. IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 38 Multi-dimensional Modelling IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 39 Multi-dimensional Modelling It is possible to conceptualise data as multidimensional Difficult to design Easy to use resulting reports Advocated by Ralph Kimball (see his manifesto, and a rebuttal, available on the web site). A logical design technique that seeks to present data in a standard framework that is intuitive and allows for high-performance access. IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 40 Multi-dimensional Modelling An approach to database design that provides an easy to understand and navigate database The aim is to encourage understanding, exploration and learning Each number in a database has a set of associated attributes What it measures, what point of time it was created, what location its from, what product it’s associated with, what promotion, etc. This makes the number meaningful. IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 41 Multi-dimensional Modelling Each attribute associated with each number represents a dimension Measure, time, location, product, location, etc. Resulting views are easy to navigate and move around Slice and dice Report template IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 42 Multi-dimensional Modelling Example Widget Sales ($Million) One Dimension (State): 43.6 53.4 31.4 27.5 28.3 14.7 Vic NSW QLD WA SA TAS State Two Dimensions (location x time): 48.2 53.4 31.4 28.4 25.1 15.4 2002 50.1 57.2 33.6 28.1 22.5 16.3 2001 56.3 62.3 35.1 29.4 21.5 13.3 2000 46.2 52.1 29.6 25.1 27.1 18.2 1999 43.6 53.4 31.4 27.5 28.3 14.7 1998 Vic NSW QLD WA SA TAS Year State IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 43 Multi-dimensional Modelling Three Dimensions (location x time x product): Year 2002 48.2 53.4 31.4 28.4 25.1 15.4 2001 50.1 57.2 33.6 28.1 22.5 16.3 2000 56.3 62.3 35.1 29.4 21.5 13.3 1999 46.2 52.1 29.6 25.1 27.1 18.2 1998 43.6 53.4 31.4 27.5 28.3 14.7 Vic NSW QLD WA SA TAS State IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 44 Multi-dimensional Modelling Usually talk about information spaces as cubes, or hyper-cubes, or n-cubes Resulting views of databases are easy to navigate and move around Slicing and dicing Report Template IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 45 Multi-dimensional Modelling Slicing and Dicing Select certain dimension values to examine a set of data: IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 46 Multi-dimensional Modelling Report Templates One template is produced for a set of slices Data changes, layout doesn’t Location Drop Down Box Product Sales: Victoria, 2001 50 Year Drop Down Box 40 30 20 10 0 Widgets Sprockets Flanges IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 Gaskets 47 Multi-dimensional Modelling From Traditional Relational to Multi-dimensional Typical relational data-base From Pilot Software OLAP White Paper Same data displayed in twodimensions Easy! (The key is to identify the continuous and discrete variables in the flat file.) IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 48 Star Schema IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 49 Star Schema Used to implement dimensional analysis using standard relational database technology Very common in data warehousing Many variations Two components: Fact Table – contains measurements of business, eg. sales, purchase order, shipment Dimension Tables – stores the textual descriptions of the dimensions of the business, eg. product, customer, vendor, store. IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 50 Star Schema Fact tables store the hard data Dimension tables store all the information about our dimensions. The fact table has a many-to-one relationship with each dimension table Each dimension table has a primary key that appears as a foreign key in the fact table, whose primary key is a concatenation of all of the foreign keys. IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 51 Star Schema Dimension tables in star schemas are denormalised resulting in: Fewer tables Simpler for users to navigate Reduced number of complex multi-join tables. IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 52 Star schema Customer Customer key Name Customer type Product Product key Product type weight Sale Time key Store key Customer key Product key Dollar sales Unit sales Store Store key Address Region Time Time key Day Month IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 Legend: Primary Key Foreign key 53 Snowflake schema “Do not snowflake your dimensions, even if very large. If you do snowflake your dimensions, prepare to live with poor performance” Kimball (1996) Product Type Product Customer Type Customer Sale Store Region Time IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 54 Star Schema Dimensions can be shared amongst fact tables. IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 55 Star Schema ER schemas are useful for data mapping to legacy systems and for integration of the data warehouse Star schemas are useful for the design of warehouse databases as they are efficient and easy to understand and use Allow relational databases to support multidimensional data cubes IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 56 Star Schema Steps in the design process 1. 2. 3. 4. 5. Choose a business process Choose the grain of the fact table Too fine > Oversized database Too large > Loss of meaningful information Choose the dimensions Choose the measured facts (usually numeric, additive quantities) Complete the dimension tables Kimball (1996) IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 57 Extra steps in the design process 6. 7. 8. 9. Determine strategy for slowly changing dimensions Create aggregations and other physical storage components Determine the historical duration of the database Determine the urgency with which the data is to be extracted and loaded into the data warehouse. Kimball (1996) IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 58 An Example: Retail Trading A large grocery store with approx. 500 stores Each store has approx. 60,000 products on shelves Need to maximise profit and keep shelves stocked Important decisions concern pricing and promotion Promotion types are: Temporary price reductions Newspaper advertisements Shelf and end-aisle displays Coupons IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 59 An Example: Retail Trading 1. Choose a Business Process Daily Item Movement 2. Choose the grain of the fact table Stock Keeping Unit (SKU) by store by promotion by day 3. Choose the Dimensions Time, product, store and promotion IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 60 An Example: Retail Trading Retail Trading Dimensions Promotion Promotion key Other Promotion attributes Sale Product Product key Other product attributes Time key Product key Store Key Promotion Key Facts – to be detailed next Store Store key Other Store Attributes Time Time key Other Time Attributes IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 61 An Example: Retail Trading 4. Choose the measured facts Promotion Promotion key Other Promotion attributes Sale Product Product key Other product attributes Time key Product key Store Key Promotion Key Dollar Sales Unit Sales Dollar Costs Customer Count Store Store key Other Store Attributes Time Time key Other Time Attributes IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 62 An Example: Retail Trading 5. Complete the dimension tables Promotion Promotion key Other Promotion attributes Product Product key SKU Description SKU Number Package Size Brand Sub Category Department Package Type Diet Type Weight Weight unit of measure Units per retail case Units per ship case Cases per pallet Sale Time key Product key Store Key Promotion Key Dollar Sales Unit Sales Dollar Costs Customer Count Store Store key Other Store Attributes Time Time key Other Time Attributes IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 63 References Inmon, W. H. (1996) Building the Data Warehouse (2nd ed), Wiley, NY. Kimball, R. (1996) The Data Warehouse Toolkit, Wiley, NY. McFadden, F., Hoffer, J. and Prescott, M. (1999) Modern Database Management, Addison-Wesley. IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 64 Questions? [email protected] School of Information Management and Systems, Monash University T1.28, T Block, Caulfield Campus 9903 2735 IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 65