* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction to Data Warehousing Overview What is a data
Data Protection Act, 2012 wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Clusterpoint wikipedia , lookup
Data center wikipedia , lookup
Forecasting wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
Data vault modeling wikipedia , lookup
Introduction to Data Warehousing Peter O’Donnell DSS Lab, Monash University Overview n n What is a data warehouse? What makes it so different? – Managers as clients – Architecture n Dimensional Modelling – Compared to traditional data modelling – Facts and dimensions – OLAP What is a data warehouse? “Subject oriented, integrated, time variant, non-volatile collection of data in support of management decision making” Inmon “The basic data warehouse architecture interposes between end-user desktops and production data sources a warehouse that we usually think of as a single, large system maintaining an approximation of an enterprise data model.” Demarest 1 Data Warehouses n n n n A set of databases created to provide information to decision makers Supports the access, understanding and analysis of data by decision makers Provides the “data infrastructure” for management support systems (eg. DSS and EIS) Most of the effort is in data extraction, transformation and load activities Another view... “Data warehousing is a process not a product. ... The data warehousing process can be broken down into 4 phases: Assemble data systematically Transform the data, correct errors and form a consistent view Distribute the data where needed Furnish high speed tools of choice Data warehousing provides a means for the useful storage of historical information allowing the user wider scope on which to base decision support information.” The Butler Group What’s so different about data warehouses? n Compared to operational systems (OLTP): – Managers as clients • What managers supposedly do • The reality – Architecture 2 Managers as clients n n n n n n Discretionary and demanding clients Chauffeured Fragmentation, brevity and variety Uncertain tasks Urgency Organisationally powerful What’s in it for managers? n n n n n Fast access to data Views of the organisation they have never had before Exception reports (data mining agents) Infra-structure for EIS Infra-structure for DSS What’s really in it for managers? n n n Beware “technocratic utopianism” Maybe nothing at all! Ackoff (1967) revisited: – MIS are based on the following false assumptions: • More information is better • Managers don’t have the information they need • Managers need the information they want • Managers don’t have to understand a system to use it n http://images.lib.monash.edu.au/ims3001/04103275. pdf 3 Operational Systems Environment n OLTP Systems tend to be: – Unintegrated – Unsynchronised – Complex – Update- Oriented – Dirty data Data Warehouse Environment n OLAP Systems (explained later) tend to be: – Subject oriented – Integrated – Time Variant – Non-Volatile (Inmon & Hackathorn, 1994) Goals of data warehouse architecture n Architectural goals (Demarest, 1994): – To protect production systems from query drain – To provide a traditional, highly manageable data oriented environment for DSS • To separate data management and query processing issues from end-user access issues – To enable data from different systems to be brought together in a logical unified fashion 4 Data Warehouse Architecture Internal Legacy Systems Query System Data Warehouse Special Purpose Data External Data Sources Executive Information System Decision Support System EIS Client EIS Client DSS Client Research at Monash - What We Know (The Benefits) n The major benefits of data warehousing we have noted are: – Better data management – Better access to data – Better decision making – A reduction in the cost associated with the production of ad hoc reports n IT professionals involved consider the investment to be very worthwhile. What we know: architecture n Organisations are using existing technologies for their datawarehouse – As a result the traditional vendors have a strong presence in the market (eg. IBM, Sun, Oracle etc.) n Client / Server architectures are dominant – However many organisations are running their data warehouse on the same platform as their OLTP systems. 5 What we know: project scale n n n n The majority of projects are not enterprisewide in scale (data marts rather than data warehouses) A small number of systems cost many millions of dollars but around $500,000 is typical (but proportional to the authority of the sponsor!) A small number of users (~10) is common The development team usually consists of 24 people Where is the technology heading? n Architecture n Project scale – Web enablement – More large projects – More Users Issues facing developers n n n n n n Shortage of skilled people Vendor support in Australia Increasing expectations of users Internet Evolutionary development Data quality (!!) 6 Some fundamentals ( Ackoff again) n n n n Don’t ask what people want Managers don’t need more information Find out what people need Use the [warehouse] to provide better information Ackoff - 1967 Evolutionary development n n n n n Users understanding of business is shaped by the information they have System is developed to suit their understanding of the business System provides better information Users understanding of business is changed System must change, ... Data Warehouse Modelling n Aims – Easily understood – Extendable – Stable – Good performance for queries and reports n ER or Star Schema or both? 7 ER Schema (Simple) Customer Type groups Customer within Region contains makes Product Type groups Product in Sale located at Store within Period (based on Kimball (1996), p29, and Simsion-Bowles (1996), p2) Traditional ER Approach to design n n Entities and relationships Rules of normalisation – 3NF is typical – Protection of integrity of database by avoiding anomalies – Every logical thing is represented only once n Separate consideration logical and physical Traditional Database Design n Large numbers of tables – Oracle Financials - 1,800; SAP 7 up to 8,000 n Commonly used – Feels natural once you get used to it n Research shows that they are not easily understood by IT people – Especially concepts like abstraction, generalisation, sub-types, etc. 8 Multi-Dimensional Models n n n It is possible to conceptualise data as multidimensional Difficult to design Easy to use resulting reports So what is this dimensional stuff anyway? n An approach to database design that provides an easy to understand and navigate database – The aim is to encourage understanding, exploration and learning n Each number has a set of associated attributes – What it measures, what point of time it was created, what location its from, what product its associated with, what promotion, etc. Multi-dimensionality n n Usually talk about information spaces as cubes or hyper cubes or n-cubes Each attribute associated with each number represents a dimension – Measure, time, location, product, location, etc. n Resulting views are easy to navigate and move around – Slice and dice – Report template 9 From Traditional Relational to Multi-dimensional Typical relational data -base From Pilot Software OLAP White Paper Same data displayed in twodimensions Easy! (The key is to identify the continuous and discrete variables in the flat file.) From a Spreadsheet to a Multidimensional report n n Typical spreadsheet model Two Dimensional? Lurking Dimensions n n What about 1997? What about other states? Other dimensions are implicit. Year and State? Spot the design choices! (Time and Region) 10 What is OLAP? n n On-Line analytical processing Term was popularised by Codd in 1993 – 12 OLAP rules defining a standard by which to assess products – Nothing new - most products already complied n n n OLAP Council Client/Server Multi-dimensional view of data OLAP and ROLAP n n Many OLAP tools have their own way of storing data (MDDB) Some make it look like the data is in a cube but actually query a relational database (ROLAP) – ‘How?’ you might ask! Star Schema n Used to implement dimensional analysis using relational database technology Very common in data warehouse n Fact table n – Many variations – additive and non additive facts n Dimension tables – become constraints (WHERE part of SQL) 11 Star schema (with attributes) Customer Customer key Name Customer type Sale Product Store Time key Store key Customer key Product key Dollar sales Unit sales Product key Product type weight Store key Address Region Time Time key Day Month Snowflake schema Customer Type Customer Product Type Product Sale Store Region Time Conversion from ER to Star n “Event remembered” or “transaction” entity types become fact tables – SALE – SHIPMENT – CLAIM n “Master” entity types become dimension tables – CUSTOMER – PRODUCT – LOCATION 12 Uses of ER and Star Schemas n n ER schemas are useful for data mapping to legacy systems and for integration of the data warehouse Star schemas are useful for the design of warehouse databases as they are efficient and easy to understand and use – Allow relational databases to support multidimensional data cubes Dimensions Dimensions n n n n Star schema might (typically) have 1015 dimensions Individual user views of the warehouse might include 6-7 of these Typical systems (eg an EIS) might have 20 different views and 4-5 different base fact tables Dimension tables can be related to a large number of facts Steps in the design process 1. Choose a business process 2. Choose the grain of the fact table Too fine > Oversized database Too large > Loss of meaningful information 3. Choose the dimensions 4. Choose the measured facts (usually numeric, additive quantities) 5. Complete the dimension tables Kimball (1996) 13 Extra steps in the design process 6. Determine strategy for slowly changing dimensions 7. Create aggregations and other physical storage components 8. Determine the historical duration of the database 9. Determine the urgency with which the data is to be extracted and loaded into the data warehouse. Kimball (1996) That’s it from me! n n Check the web Useful links: – www.sims .monash.edu.au/dsslab – www.rkimball.com – www.dwassit.com – www.olap.org n Stuff to read – Anything by Ralph Kimball, Bill Inmon, lots of others 14