Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehousing Fundamentals Volume 2 • Student Guide ....................................................................................... 50102GC20 Production 2.0 May 1999 M08762 Authors Copyright Oracle Corporation, 1999. All rights reserved. Chon S. Chua This documentation contains proprietary information of Oracle Corporation. It is provided under a license agreement containing restrictions on use and disclosure and is also protected by copyright law. Reverse engineering of the software is prohibited. If this documentation is delivered to a U.S. Government Agency of the Department of Defense, then it is delivered with Restricted Rights and the following legend is applicable: Richard Green Technical Contributors and Reviewers Jackie Collins Restricted Rights Legend Jennifer Jacoby Use, duplication or disclosure by the Government is subject to restrictions for commercial computer software and shall be deemed to be Restricted Rights software under Federal law, as set forth in subparagraph (c) (1) (ii) of DFARS 252.227-7013, Rights in Technical Data and Computer Software (October 1988). Mike Schmitz John Haydu Russ Pitts Lauran Serhal Brian Pottle Donna Corrigan Patricia Moll Harry Penbert SuiWah Chan Joel Barkin Steve Dressler Publisher Tony McGettigan This material or any portion of it may not be copied in any form or by any means without the express prior written permission of Oracle Corporation. Any other copying is a violation of copyright law and may result in civil and/or criminal penalties. If this documentation is delivered to a U.S. Government Agency not within the Department of Defense, then it is delivered with “Restricted Rights,” as defined in FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987). The information in this document is subject to change without notice. If you find any problems in the documentation, please report them in writing to Education Products, Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065. Oracle Corporation does not warrant that this document is error-free. Data Warehouse Method—A Methodology for Designing Data Warehouse, SQL*Loader, PL/SQL, Pro*C, Oracle7, Oracle8, and Oracle8i, Distributed Option, Parallel Query Option, Parallel Server Option, Media Server, Spatial Data Option, ConText Option, Video Server, Text Server, WebServer, Oracle Universal Server ROLAP Option, Express Server, Web-enabled Express Server, SQL*Net, Developer/2000, Relational Access Manager, Discoverer, Designer/2000, SQL*Bridge, Transparent Gateway Developer’s Kit, Procedural Gateway Developer’s Kit, Express, Express Analyzer, Express Objects, Sales Analyzer, and Financial Analyzer are product names, trademarks, or registered trademarks of Oracle Corporation. All other products or company names are used for identification purposes only and may be trademarks of their respective owners. Contents ..................................................................................................................................................... Preface Profile xi Related Publications xiv Typographic Conventions xv Lesson 1: Introduction Course Objectives 1-3 Agenda 1-5 Questions About You 1-9 Lesson 2: Meeting a Business Need Overview 2-3 Unsuitability of OLTP Systems for Complex Analysis 2-5 Management Information Systems and Decision Support 2-7 Data Extract Processing 2-9 Business Drivers for Data Warehouses 2-15 Current Situation and Growth of Data Warehousing 2-19 Typical Uses of a Data Warehouse 2-21 Summary 2-23 Practice 2-1 2-25 Lesson 3: Defining Data Warehouse Concepts and Terminology Overview 3-3 Data Warehouse Definition 3-5 Data Warehouse Properties 3-7 Data Warehouse Terminology 3-21 Components of a Data Warehouse 3-25 Oracle Warehouse Vision, Products, and Services 3-31 Summary 3-41 Practice 3-1 3-43 Lesson 4: Driving Implementation Through a Methodology Overview 4-3 Warehouse Development Approaches 4-5 The Need for an Iterative and Incremental Methodology 4-13 ..................................................................................................................................................... Data Warehousing Fundamentals iii Contents ..................................................................................................................................................... Oracle Data Warehouse Method 4-15 DWM Fundamental Elements 4-19 Oracle Warehouse Technology Initiative (WTI) 4-57 Summary 4-61 Practice 4-1 4-63 Lesson 5: Planning for a Successful Warehouse Overview 5-3 Managing Financial Issues 5-5 Obtaining Business Commitment 5-9 Managing a Warehouse Project 5-15 Identifying Planning Phases 5-29 Identifying Warehouse Strategy Phase Deliverables 5-31 Identifying Project Scope Phase Deliverables 5-35 Summary 5-41 Practice 5-1 5-43 Lesson 6: Analyzing User Query Needs Overview 6-3 Types of Users 6-5 Gathering User Requirements 6-7 Managing User Data Access 6-9 Security 6-21 OLAP 6-25 Query Access Architectures 6-47 Summary 6-51 Practice 6-1 6-53 Lesson 7: Modeling the Data Warehouse Overview 7-3 Data Warehouse Database Design Phases 7-5 Phase One: Defining the Business Model 7-7 Phase Two: Creating the Dimensional Model 7-17 Data Modeling Tools 7-39 ..................................................................................................................................................... iv Data Warehousing Fundamentals Contents ..................................................................................................................................................... Summary 7-41 Practice 7-1 7-43 Lesson 8: Choosing a Computing Architecture Overview 8-3 Architecture Requirements 8-5 The Hardware Architecture 8-7 Database Server Requirements 8-29 Parallel Processing 8-33 Summary 8-39 Practice 8-1 8-41 Lesson 9: Planning Warehouse Storage Overview 9-3 The Server Data Architecture 9-5 Protecting the Database 9-17 Summary 9-27 Practice 9-1 9-29 Lesson 10: Building the Warehouse Overview 10-3 Extracting, Transforming, and Transporting Data 10-5 Extracting Data 10-13 Examining Data Sources 10-15 Extraction Techniques 10-23 Extraction Tools 10-35 Summary 10-39 Practice 10-1 10-41 Lesson 11: Transforming Data Overview 11-3 Importance of Data Quality 11-5 Transformation 11-13 Transforming Data: Problems and Solutions 11-17 Transformation Techniques 11-33 ..................................................................................................................................................... Data Warehousing Fundamentals v Contents ..................................................................................................................................................... Transformation Tools Summary 11-57 Practice 11-1 11-59 11-53 Lesson 12: Transportation: Loading Warehouse Data Overview 12-3 Transporting Data into the Warehouse 12-5 Building the Transportation Process 12-11 Transporting the Data 12-15 Postprocessing of Loaded Data 12-25 Summary 12-39 Practice 12-1 12-41 Lesson 13: Transportation: Refreshing Warehouse Data Overview 13-3 Capturing Changed Data 13-5 Limitations of Methods for Applying Changes 13-25 Purging and Archiving Data 13-33 Final Tasks 13-39 Selecting ETT Tools 13-43 Summary 13-51 Practice 13-1 13-53 Lesson 14: Leaving a Metadata Trail Overview 14-3 Defining Warehouse Metadata 14-5 Developing a Metadata Strategy 14-11 Examining Types of Metadata 14-19 Metadata Management Tools 14-33 Common Warehouse Metadata 14-35 Summary 14-37 Practice 14-1 14-39 Lesson 15: Supporting End-User Access Overview 15-3 ..................................................................................................................................................... vi Data Warehousing Fundamentals Contents ..................................................................................................................................................... Business Intelligence 15-5 Multidimensional Query Techniques 15-7 Categories of Business Intelligence Tools 15-9 Data Mining in a Warehouse Environment 15-19 Oracle Data Mining Partners 15-33 Summary 15-35 Practice 15-1 15-37 Lesson 16: Web-Enabling the Warehouse Overview 16-3 Accessing the Warehouse Over the Web 16-5 Common Web Data Warehouse Architecture 16-9 Issues in Deploying a Data Warehouse on the Web 16-11 Evaluating Web-Based Tools 16-19 Summary 16-23 Practice 16-1 16-25 Lesson 17: Managing the Data Warehouse Overview 17-3 Managing the Transition to Production 17-5 Managing Growth 17-19 Managing Backup and Recovery 17-33 Identifying Data Warehouse Performance Issues Summary 17-51 17-45 Appendix A: Practice Solutions Practice 2-1 A-2 Practice 3-1 A-4 Practice 4-1 A-7 Practice 5-1 A-11 Practice 6-1 A-12 Practice 7-1 A-13 Practice 8-1 A-14 Practice 9-1 A-15 ..................................................................................................................................................... Data Warehousing Fundamentals vii Contents ..................................................................................................................................................... Practice 10-1 Practice 11-1 Practice 12-1 Practice 13-1 Practice 14-1 Practice 15-1 Practice 16-1 A-18 A-20 A-21 A-23 A-24 A-26 A-28 Glossary ..................................................................................................................................................... viii Data Warehousing Fundamentals 10 ................................. Building the Warehouse Lesson 10: Building the Warehouse ..................................................................................................................................................... Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Choosing a Computing Architecture Meeting a Business Need Modeling the Data Warehouse Analyzing User Query Needs Planning Warehouse Storage ETT ETT (Building (Building the the Warehouse) Warehouse) Managing the Data Warehouse Supporting End User Access Project Management (Methodology, Maintaining Metadata) Copyright Oracle Corporation, 1999. All rights reserved. Objectives After completing this lesson, you should be able to do the following: • Outline the extraction, transformation, and transportation processes for building a data warehouse • • • • Identify extraction issues Explain how to examine data sources Identify extraction techniques List tools that can be used to extract data from sources Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-2 Data Warehousing Fundamentals Overview ..................................................................................................................................................... Overview In this lesson, you explore the sources of data for the data warehouse data. You consider how the extraction and transformation processes take data from source systems and change it into data that is acceptable to the users of the data warehouse. The lesson also describes typical data anomalies and looks at ways to eliminate them. Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page. Objectives After completing this lesson, you should be able to do the following: • Outline the extraction, transformation, and transportation processes for building a data warehouse. • Identify extraction issues. • Explain how to examine data sources. • Identify extraction techniques. • List tools that can be used to extract data from sources. ..................................................................................................................................................... Data Warehousing Fundamentals 10-3 Lesson 10: Building the Warehouse ..................................................................................................................................................... Extraction/Transformation/Transportation Processes (ETT) • • • Extract source data Transform/clean data Index and summarize Browser: Cu http:// s Hollywoo d tom Hollywoo d Detect changes Refresh data a recoro f as http:// + Load data into WH ETT er+s X : Browser: X • • • Cu st Browser: http:// om er+ X s: Hol lywood Programs Gateways Operational systems Tools Warehouse Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-4 Data Warehousing Fundamentals Extracting, Transforming, and Transporting Data ..................................................................................................................................................... Extracting, Transforming, and Transporting Data Extraction, Transformation, and Transportation Tasks Before considering this lesson’s focus on extraction, you should be aware that extraction, transformation, and transportation (sometimes called ETT) describes the series of processes that: • Extract data from source systems • Transform and clean up the data • Index the data • Summarize the data • Load data into the warehouse • Detect the changes made to source data required for the warehouse • Restructure keys • Maintain the metadata • Refresh the warehouse with updated data You can use custom programming, gateways between database systems, and internally developed tools or vendor tools to carry out the ETT processes. ..................................................................................................................................................... Data Warehousing Fundamentals 10-5 Lesson 10: Building the Warehouse ..................................................................................................................................................... ETT Processes • Must result in data that is relevant, useful, highquality, accurate, and accessible • Require a large proportion of warehouse development time and resources ETT Relevant Useful Clean up Quality Consolidate Operational systems Restructure Warehouse Accurate Accessible Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-6 Data Warehousing Fundamentals Extracting, Transforming, and Transporting Data ..................................................................................................................................................... ETT Processes ETT Importance The extraction, transformation, and transportation processes are absolutely fundamental in ensuring that the data resident in the warehouse is: • Relevant and useful to the business users • High quality • Accurate • Easy to access so that the warehouse is used efficiently and effectively by the business users ETT Cost Building the ETT process is potentially one of the biggest tasks of building a warehouse; it is complex and time-consuming. In some implementations, it can take more than half of the total warehouse implementation effort. Note: Extraction is covered by this lesson; transformation and transportation are considered in the next two lessons. ..................................................................................................................................................... Data Warehousing Fundamentals 10-7 Lesson 10: Building the Warehouse ..................................................................................................................................................... Data Staging Area • • • The construction site for the warehouse • Frequently configured as multitier staging Required by most implementations Composed of ODS, flat files, or relational server tables Operational system Extract Data staging area Transport (Load) Transform Warehouse Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-8 Data Warehousing Fundamentals Extracting, Transforming, and Transporting Data ..................................................................................................................................................... The Data Staging Area Ralph Kimball is one of the most widely recognized experts in the field of data warehousing. Kimball calls the data staging area the construction site for the warehouse. This is where much of the data transformation and cleansing takes place. A staging area is a typical requirement of warehouse implementations. It may be an operational data store environment, a set of flat files, a series of tables in a relational database server, or proprietary data structures used by data staging tools. You may employ multitier staging that reconciles data before and after the transformation process and before data is loaded into the warehouse. As many as three tiers are possible, from the operational server to the staging area and then to the warehouse server. Note: Some ETT tools stage data internally and do not require a separate staging area. If you are using the Oracle server and in-house developed tools, data is typically transformed after it is bulk-loaded (using SQL*Loader) into the staging area—the database tables. PL/SQL is often used to transform the data. You may also use gateways and replication techniques. ..................................................................................................................................................... Data Warehousing Fundamentals 10-9 Lesson 10: Building the Warehouse ..................................................................................................................................................... Remote Staging Model Data staging area within the warehouse environment Warehouse environment Oper. envt. Operational system Extract, transform, transport Data staging area Transport Transform (Load) Warehouse Data staging area in its own environment, avoiding negative impact on the warehouse environment Staging envt. Oper. envt. Operational system Warehouse envt. Data staging area Transport Extract, (Load) transform, Transform transport Warehouse Copyright Oracle Corporation, 1999. All rights reserved. Onsite Staging Model Data staging area within the operational environment, possibly affecting the operational system WH envt. Operational environment Operational system Extract Data staging area Transform Transport (Load) Warehouse Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-10 Data Warehousing Fundamentals Extracting, Transforming, and Transporting Data ..................................................................................................................................................... Possible Staging Models Choosing a Model The model you choose depends upon operational and warehouse requirements, system availability, connectivity bandwidth, gateway access, and volume of data to be moved or transformed. Remote Staging Model You may choose to extract the data from the operational environment and transport it into the warehouse environment for transformation processing. You may optionally execute some transformation processing during the extraction and transportation from operational to warehouse environment. You would then execute the bulk of transformation processing in the warehouse environment’s staging area. On-site Staging Model Alternatively, you may choose to perform the cleansing, transformation, and summarization processes locally in the operational environment and then extract to the staging area. This model may conflict with the day-to-day working of the operational system. If chosen, this model’s process should be executed when the operational system is idle or less heavily used. ..................................................................................................................................................... Data Warehousing Fundamentals 10-11 Lesson 10: Building the Warehouse ..................................................................................................................................................... Extracting Data Data mapping Browser: Cus http:// Browser: http:// X + Hollywood +X ers : Cus Browser: tom http:// Hollywood a recoro f as Hollywood tom +X ers : Transform Operational databases • • • Warehouse database Data staging area Routines developed to select fields from source Various data formats Rules, audit trails, error correction facilities Copyright Oracle Corporation, 1999. All rights reserved. Source Systems Browser: http:// Cus Hol lywood X http:// X + Hollywood Browser: Cust http:// Hollywood om e rs:+ X as Browser: Archive f Production a recoro • • • • tom ers + : Internal External 12345.00 12780.00 2345787.00 87877.98 5678.00 100% 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-12 Data Warehousing Fundamentals Extracting Data ..................................................................................................................................................... Extracting Data The process of data extraction takes selected data fields that pertain to the subject area maintained by the data warehouse. The data may come from a variety of source systems, and the data may exist in a variety of formats. The extraction routines are developed to account for the variety of systems from which data is taken. These routines contain data or business rules, as well as audit trails and error correction facilities. Source Systems The source systems mentioned may be in the form of data existing in: • Production operational systems • Archives • Internal files not directly associated with company operational systems, such as individual spreadsheets and workbooks • External data from outside the company Extraction Routines The routines created for extraction are specifically developed to account for the variety of systems from which data is taken. The routines contain data or business rules, audit trails, and error correction facilities. The routines take into account the frequency with which data is to be extracted. ..................................................................................................................................................... Data Warehousing Fundamentals 10-13 Lesson 10: Building the Warehouse ..................................................................................................................................................... Production Data Browser: http://Cu st Hollywood IMS Browser: http:// X + Hollywood om ers+ : X C Browser:usto http:// m er + s: SAP X Hollywood orof as DB2 a rec Shared Medical Systems VSAM Dun and Bradstreet Financials NonStop SQL Oracle Hogan Financials Sybase Oracle Financials Rdb • • • • Operating system platforms Hardware platforms File systems Database systems and vertical applications Copyright Oracle Corporation, 1999. All rights reserved. Archive Data Operational databases • • • • Warehouse database Historical data Useful for analysis over long periods of time Useful for first-time load May require unique transformations Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-14 Data Warehousing Fundamentals Examining Data Sources ..................................................................................................................................................... Examining Data Sources Production Data Production data may come from a multitude of different sources: • Operating system platforms • Hardware platforms • File systems (flat files) • Database systems, for example, Oracle, DB2, dBase, Informix, ISAM, NonStop SQL, Rdb, and TurboImage • Vertical applications, such as Oracle Financials, SAP, PeopleSoft, Baan, and Dun and Bradstreet Archive Data Archive data may be useful to the enterprise in supplying historical data. Historical data is needed if analysis over long periods of time is to be achieved. Archive data is not used consistently as a source for the warehouse; for example, it would not be used for regular data refreshes. However, for the initial implementation of a data warehouse (and the first-time load), archived data is an important source of historical data. You need to consider this carefully when planning the data warehouse. How much historical data do you have available for the data warehouse? How much effort is necessary to transform it into an acceptable format? The data warehouse may need some careful and unique transformations, and clear details of the changes must be maintained in metadata. ..................................................................................................................................................... Data Warehousing Fundamentals 10-15 Lesson 10: Building the Warehouse ..................................................................................................................................................... Internal Data Planning Marketing Accounting • • ABC CO 12345.00 100% 12780.00 110% 2345787.00 230% GBUK INC 87877.98 200% FFR ASSOC 5678.00 -10% GMBH LTD MCD CO 12345.00 100% 12780.00 110% 2345787.00 230% GBUK INC 87877.98 200% FFR ASSOC ABC CO 5678.00 -10% GMBH LTD MCD CO ABC CO 12345.00 100% 12780.00 110% 2345787.00 230% GBUK INC 87877.98 200% FFR ASSOC 5678.00 -10% MCD CO GMBH LTD Warehouse database Planning, sales, and marketing organization data Maintained by: – Spreadsheets (structured) – Documents (unstructured) • Treated like any other source data Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-16 Data Warehousing Fundamentals Examining Data Sources ..................................................................................................................................................... Internal Data Internal data may be information prepared by planning, sales, or marketing organizations that contains data such as budgets, forecasts, or sales quotas. The data contains figures (numbers) that are used across the enterprise for comparison purposes. The data is maintained using software packages such as spreadsheets and word processors and uploaded into the warehouse. Internal data is treated like any other source system data. It must be transformed, documented in metadata, and mapped between the source and target databases. ..................................................................................................................................................... Data Warehousing Fundamentals 10-17 Lesson 10: Building the Warehouse ..................................................................................................................................................... External Data A.C. Nielsen, IRI, IMS, Walsh America Purchased Competitive databases information Economic forecasts Dun and Bradstreet Barron’s • • • Warehousing databases Wall Street Journal Information from outside the organization Issues of frequency, format, and predictability Described and tracked using metadata Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-18 Data Warehousing Fundamentals Examining Data Sources ..................................................................................................................................................... External Data External data is important if you want to compare the performance of your business against others. There are many sources for external data: • Periodicals and reports • External syndicated data feeds (Some warehouses rely regularly on this as a source) • Competitive analysis information • Newspapers • Purchased marketing, competitive, and customer related data • Free data from the Web Issues You must consider the following issues with external data: • Frequency: There is no real pattern like that of internal data. Constant monitoring is required to determine when it is available. • Format: The data may be different in format than internal data, and the granularity of the data may be an issue. In order to make it useful to the warehouse a certain amount of reformatting may be required. In addition, you may find that external data, particularly that available on the Web, comes with digital audio data, picture image data, and digital video data. These present an interesting challenge in storage and speed of access. • Predictability: External data is not predictable; it can come from any source at any time, in any format, on any medium. Tracked Using Metadata Metadata (described earlier as descriptive data about data) plays an invaluable role in the registration, access, and control of external data. The metadata should provide the warehouse manager with as much information about the external data as possible, averting the need to examine the data closely. Note: ETT decisions and strategies can evolve over time throughout the life of the warehouse. It may be prudent to track those strategies and decisions, so that you can always explain the algorithmic logic or business rules used at different times with current, recent, or archived data. ..................................................................................................................................................... Data Warehousing Fundamentals 10-19 Lesson 10: Building the Warehouse ..................................................................................................................................................... Mapping • • Defines which operational attributes to use • • Defines where the attributes exist in the warehouse Defines how to transform the attributes for the warehouse Mapping tools are available Metadata File A F1 F2 F3 File A F1 F2 F3 123 Bloggs 10/12/56 Staging File One Number Name DOB Staging File One Number USA123 Name Mr. Bloggs DOB 10-Dec-56 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-20 Data Warehousing Fundamentals Examining Data Sources ..................................................................................................................................................... Mapping Data Once you have determined your business subjects for the warehouse, you need to determine the required attributes from the source systems. On an attribute-by-attribute basis you must determine how the source data maps into the data warehouse, and what, if any, transformation rules to apply. This is known as mapping. There are mapping tools available. Mapping information should be maintained in metadata that is server (RDBMS) resident, for ease of access, maintenance, and clarity. ..................................................................................................................................................... Data Warehousing Fundamentals 10-21 Lesson 10: Building the Warehouse ..................................................................................................................................................... Extraction Techniques • • • • Programs: C, COBOL, PL/SQL Gateways: transparent database access In-house development is popular Tools – High initial cost – Ongoing automation – Data cleanup Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-22 Data Warehousing Fundamentals Extraction Techniques ..................................................................................................................................................... Extraction Techniques You can extract data from different source systems to the warehouse in different ways: • Programmatically, using procedural languages such as COBOL, C, C++, or Procedural SQL • Using a gateway to access data sources. This method is acceptable only for small amounts of data; otherwise, the network traffic becomes unacceptably high. • In-house developed tools that: – Store a physical definition of the source and warehouse data – Create data dictionaries – Generate data conversion programs – Clean and transform the data – Allow selective retrieval – Maintain metadata Note: In-house development is an ongoing activity that may become a resources black hole. You need local knowledge to support all of the file formats. • Using a vendor’s data extraction tool Although it is expensive, an extraction tool: – Provides ongoing automation of the data extraction process – Supports data cleanup More than 50% of companies use their own in-house development teams to develop data extraction programs. The extraction process may access different host systems media, such as fiche, optical, tape, CD, and disk formats. ..................................................................................................................................................... Data Warehousing Fundamentals 10-23 Lesson 10: Building the Warehouse ..................................................................................................................................................... Sources and Targets Sources ODS Warehouse Access Data marts Data analysis Data mining OLAP Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-24 Data Warehousing Fundamentals Extraction Techniques ..................................................................................................................................................... Sources and Targets To summarize, the data for the warehouse is a complex mixture of structured and unstructured data from different source systems. It all needs to be moved in a clean and integrated state into the warehouse. Note: The same process is performed for current data that is to reside in an operational data store. ..................................................................................................................................................... Data Warehousing Fundamentals 10-25 Lesson 10: Building the Warehouse ..................................................................................................................................................... Designing Extraction Processes • Analysis: – Sources, technologies – Data types, quality, owners • Design options: – Manual, custom, gateway, third-party – Replication, full, or delta refresh • Design issues: – Batch window, volumes, data currency – Automation, skills needed, resources Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-26 Data Warehousing Fundamentals Extraction Techniques ..................................................................................................................................................... Designing Extraction Processes When designing your extraction processes, consider the analysis issues, the design options available to you, and the design issues. Analysis • The sources and technologies used • Existing data feeds and redo logs • Data types (EBCDIC or ASCII) • Data quality and ownership • Data volumes • Operational schedule in the source environment • Spare processing capacity in the source environment Design Options • Manual data entry • Custom programs • Gateway technologies • Replication techniques • Third party tools • Full refresh or delta changes Design Issues • Batch window • Data volumes • Data currency (how up-to-date the data is to be) • Degree of automation required • Technology skills needed • Time and money available ..................................................................................................................................................... Data Warehousing Fundamentals 10-27 Lesson 10: Building the Warehouse ..................................................................................................................................................... Maintaining Extraction Metadata • • • • • • • Source location, type, structure Access method Privilege information Temporary storage Failure procedures Validity checks Handlers for missing data Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-28 Data Warehousing Fundamentals Extraction Techniques ..................................................................................................................................................... Maintaining Extraction Metadata It is essential to maintain a “metadata trail” of information about all ETT processes, including the extraction process. This information is important for warehouse enhancement and performance improvements. The quality of metadata is critical for every aspect of the warehouse; attention must be paid to its control, management, and change. Extraction metadata includes: • The source location, type, contact, and structure information • The access method • The privilege information • The extraction temporary storage information • The extraction failure and validity check procedures information • Information about how to handle missing data Extraction metadata also contains information about the frequency of program execution and maps the source data to the target database. ..................................................................................................................................................... Data Warehousing Fundamentals 10-29 Lesson 10: Building the Warehouse ..................................................................................................................................................... Possible ETT Failures • • • • • • • • A missing source file A system failure Inadequate metadata Poor mapping information Inadequate storage planning A source structural change No contingency plan Inadequate data validation Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-30 Data Warehousing Fundamentals Extraction Techniques ..................................................................................................................................................... Possible ETT Failures ETT processes are vital to the warehouse, and they must succeed. ETT may fail for any of the following reasons: • Extraction routines must specify the name and location of the source data. A missing file may cause the extraction to fail. You must therefore ensure that exception and error handling routines are included. • If there is a system or media failure during the process, the process may fail entirely. You must start again or you may, depending upon system settings, be able to continue from the point of failure. • Metadata that inadequately describes the source to destination mapping and rules will cause ETT to fail; for example, when an unexpected value is found. • Without the space for temporary data, staging data, and sorting operations, ETT fails. • Any changes to the source systems that are not documented in metadata will cause extraction to fail. • Contingency plans are needed, including mechanisms for correcting or reapplying processing. • If data is not validated correctly, the quality of extraction and the success of transformation cannot be guaranteed. This translates to a data warehouse that may contain dirty data at the end of the load. ..................................................................................................................................................... Data Warehousing Fundamentals 10-31 Lesson 10: Building the Warehouse ..................................................................................................................................................... Maintaining ETT Quality • ETT must be: – Tested – Documented – Monitored and reviewed • Disparate metadata must be coordinated Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-32 Data Warehousing Fundamentals Extraction Techniques ..................................................................................................................................................... Maintaining ETT Quality Any failure of the ETT processes affects data quality, the importance of which cannot be underestimated. Inaccurate data leads to inaccurate analysis results, which lead to bad business decisions. The result of poor data quality is a lack of confidence in the system to deliver the solution. Testing the Process You should test the proposed ETT techniques to ensure that volumes can be physically moved within the load window constraints and network capabilities. Documenting the Process You must communicate and document the proposed load processes with the operations organization to ensure their agreement and commitment to this important process. Monitoring and Reviewing the Process You should ensure that the load is constantly monitored and reviewed, and revise metrics where needed. Warehouse data volumes grow rapidly, and metrics for load and data granularity need regular revision. The grain of the warehouse affects query capabilities and the warehouse size. ..................................................................................................................................................... Data Warehousing Fundamentals 10-33 Lesson 10: Building the Warehouse ..................................................................................................................................................... Extraction Tools Map Source Data to Intermediate File Store Sales and Marketing Customer Name Varchar Char 20 Mapping information Unique name JCL files Update metadata Copyright Oracle Corporation, 1999. All rights reserved. Selection Criteria • • • • • • • • • • Base functionality Interface features Metadata repository Open API Metadata access Repository utilities Input and output processing Cleansing, reformatting, and auditing References Training requirements Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-34 Data Warehousing Fundamentals Extraction Tools ..................................................................................................................................................... Extraction Tools Extraction tools normally have a GUI front end that allows you to enter the individual field mappings from source to target systems. The tools normally: • Generate the required code for the mapping, whether COBOL, C, or any other language • Create the necessary job control and scheduling files for the specific platform • Create and manage changes to the metadata Selection Criteria The warehouse uses a host of different tools for extraction, modeling, management, and access. A tools selection committee must ensure that every tool selected meets identified requirements. This is usually a rigorous process. If you decide to buy an extraction tool, consider the following fundamental issues: • Base functionality • Interface features and functionality • The metadata repository and the attributes stored in the repository • Open API • Access to metadata by end users • The effectiveness of the way that the tool presents the information • Repository utilities such as scheduling, name, and address management • Data extraction inputs and outputs • Data cleansing, reformatting, and auditing features Ask the tool vendor for customer references, so that you can ask those customers to describe their goals, successes, and failures with the product. Consider the training required for the extraction tool. The complexity of the available extraction products varies, as does the ability of your staff. Training may be required for a few days or weeks. ..................................................................................................................................................... Data Warehousing Fundamentals 10-35 Lesson 10: Building the Warehouse ..................................................................................................................................................... WTI Partner ETT Tools • • • • • • • • • Carleton Constellar Evolutionary Technologies Informatica Information Builders Oracle EDMS, Toolkits, OADW Prism Solutions Sagent Vality Technology Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-36 Data Warehousing Fundamentals Extraction Tools ..................................................................................................................................................... WTI Partner ETT Tools WTI Partner Product Carleton Corp Carleton Passport, Carleton Passport Development Workbench Constellar Constellar Hub Evolutionary Technologies ETI Development Workbench, ETI Extract Tool Suite Informatica Corporation PowerMart (Designer, Server, and Manager) Information Builders, Inc. EDA Copy Manager Oracle EDMS (Extraction and Transformation Template) Toolkits OADW Prism Solutions, Inc. Prism Change Manager, Prism Development Workbench, Prism Warehouse Manager Sagent Data Mart Suites Vality Technology, Inc. Integrity Data Re-engineering Tool The choice of ETT techniques and tools is often driven by the quality of the source data. ..................................................................................................................................................... Data Warehousing Fundamentals 10-37 Lesson 10: Building the Warehouse ..................................................................................................................................................... Summary This lesson discussed the following topics: • ETT processes are essential and consume a large proportion of warehouse resources and time • • • • The extraction process acquires source data You may encounter many data sources There are many data extraction issues ETT Tools should be considered Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-38 Data Warehousing Fundamentals Summary ..................................................................................................................................................... Summary This lesson discussed the following topics: • ETT processes are essential and consume a large proportion of warehouse resources and time • The extraction process acquires source data • You may encounter many data sources • There are many data extraction issues • ETT Tools should be considered ..................................................................................................................................................... Data Warehousing Fundamentals 10-39 Lesson 10: Building the Warehouse ..................................................................................................................................................... Practice 10-1 Overview This practice covers the following topics: • • Answering a series of short questions Specifying true or false to a series of statements Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 10-40 Data Warehousing Fundamentals Practice 10-1 ..................................................................................................................................................... Practice 10-1 Please answer the following questions. 1 The acronym ETT stands for _________________________________________. 2 Name at least four potential sources of production data for the warehouse. _____________________ _____________________ _____________________ _____________________ 3 Name at least five potential sources of external data for the warehouse. ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ 4 Identify whether the following statements are true or false. Question True False Archive data is never used in a data warehouse; it is too old. External data is one of the easiest types of data to incorporate into the warehouse. Mapping data is a process whereby you eliminate data inconsistencies. Gateways are great mechanisms for transferring large volumes of data into the warehouse. Extraction tools are expensive. Transforming data occurs only in the staging area. ..................................................................................................................................................... Data Warehousing Fundamentals 10-41 Lesson 10: Building the Warehouse ..................................................................................................................................................... ..................................................................................................................................................... 10-42 Data Warehousing Fundamentals 11 ................................. Transforming Data Lesson 11: Transforming Data ..................................................................................................................................................... Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Planning Warehouse Storage Choosing a Computing Architecture Meeting a Business Need Modeling the Data Warehouse ETT ETT (Building (Building the the Warehouse) Warehouse) Analyzing User Query Needs Managing the Data Warehouse Supporting End User Access Project Management (Methodology, Maintaining Metadata) Copyright Oracle Corporation, 1999. All rights reserved. Objectives After completing this lesson, you should be able to do the following: • • • • • Explain the importance of quality data Define the term “transformation” Identify transformation issues Describe techniques for transforming data List tools that can be used to transform data Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-2 Data Warehousing Fundamentals Overview ..................................................................................................................................................... Overview The last lesson introduced extraction, transformation, and transportation. The lesson then focused on extraction issues. In this lesson, you explore how the transformation process transforms data from source systems into data suitable for end user query and analysis applications. Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page. Objectives At the end of this lesson, you should be able to: • Explain the importance of quality data • Define the term “transformation” • Identify transformation issues • Describe techniques for transforming data • List tools that can be used to transform data ..................................................................................................................................................... Data Warehousing Fundamentals 11-3 Lesson 11: Transforming Data ..................................................................................................................................................... Importance of Data Quality Browser: http:// Hollywood Cus tom ers:+ X a reco Browser: C us tom http:// ers+ : as rof Hollywood X Hollywood Speedy Pizza Browser: http:// X + Hollywood Summit Sports Copyright Oracle Corporation, 1999. All rights reserved. Benefits of Quality Data • Clean data is essential for: – Targeting customers – Determining buying patterns – Identifying householders: private and commercial – Matching customers – Identify historical data • Dirty data must be removed. Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-4 Data Warehousing Fundamentals Importance of Data Quality ..................................................................................................................................................... Importance of Data Quality Importance of Quality Data The importance of quality data in the data warehouse cannot be overemphasized. Although data anomalies are bound to exist in source systems, if they are allowed to get into the data warehouse this leads to inaccurate information, which further leads to inaccurate reports and bad business decisions. The overall result is a lack of confidence in the system to deliver the solution and a data warehouse that either is not used or requires substantial improvement and management buy-in. Quality data is the key to a successful warehouse; it is better to have no data at all than bad data. Benefits of Quality Data All dirty data must be eliminated from the staging area, to ensure you can query the warehouse to: • Target the right audience for marketing communication • Determine that a particular customer buys related products • Determine that a group of people form a family, each of whom is a potential customer (householding) • Identify that an organization is part of a larger enterprise (commercial householding) • Identify that a customer is now part of another organization, because of acquisition or take over • Match customers where there are many different records for the same customer. (For example, the different components of health care, such as the hospital, the pharmacy, and the doctor have their own records, or a patient may be treated by different physicians in the same hospital.) • Identify the age of data and its history Note: The terms scrubbing, cleaning, cleansing, and data reengineering are used interchangeably. ..................................................................................................................................................... Data Warehousing Fundamentals 11-5 Lesson 11: Transforming Data ..................................................................................................................................................... Standards • • Define a quality strategy Decide on optimal data-quality level Copyright Oracle Corporation, 1999. All rights reserved. Quality Improvements • • • • • Consider modifying rules for operational data Document the sources Create a data stewardship program Design the cleanup process carefully Initial cleanup and refresh routines may differ Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-6 Data Warehousing Fundamentals Importance of Data Quality ..................................................................................................................................................... Standards A data-quality strategy must be defined early on in the development cycle. It is imperative that you have one in place. The strategy defines the optimal level of data quality that provides the value required for the business. For example, there is little point in seeking a low data inconsistency rate at great expense if the benefit to the business is not tangible. Improving Operational Data Quality You may need to consider making changes over time to the operational system in order to improve the quality of data for the warehouse: • Some of the validation and integrity rules that are applied to current operational data may need to be modified or enhanced. • You may need to document previously undocumented sources, enlist the help of users who know the business data, and consider creating a “data stewardship” program. • You should carefully examine the cleanup processes that you employ in transforming the extracted data. • The initial data cleanup routines may be different from the routines applied to subsequent data refreshes. Correcting data can be tedious, time-consuming, and expensive. Consider any modifications in a phased approach rather than fixing all problems in one attempt. ..................................................................................................................................................... Data Warehousing Fundamentals 11-7 Lesson 11: Transforming Data ..................................................................................................................................................... Guidelines • Operational data should not be used directly in the warehouse • Operational data must be cleaned for each increment • Operational data is not simply fixed by modifying applications Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-8 Data Warehousing Fundamentals Importance of Data Quality ..................................................................................................................................................... Guidelines Do not assume that because the data in the operational system suits you at the operational level, it is going to be appropriate, suitable, and of a sufficiently high quality for the data warehouse. • The operational system contains no aging information. • There are many examples of disparity in the data. • There are many different meanings applied to data. • Good operational data when merged may become poor data warehouse data. Do not assume it is acceptable to clean up data after the pilot run of the first increment or implementation. • The credibility of the data warehouse or data mart suffers. • Postimplementation cleanups are more costly and the risk is higher than during the pilot run. • The programs needed to handle the multitude of problems are very complex and would need to be rewritten after cleanup. Do not assume that fixing applications at the point of entry (operational system) is going to satisfy quality and clean up the data for the future. • It is often too time-consuming and costly to continually implement changes at that level. • Changes cannot be implemented quickly enough to keep up with constantly changing operational requirements. The cost in time and resources in reengineering the existing legacy data may be too high. ..................................................................................................................................................... Data Warehousing Fundamentals 11-9 Lesson 11: Transforming Data ..................................................................................................................................................... Solutions • • • • Conventional COBOL, 4GL Specialized tools Customized conversion process Business experts Investigation Conditioning Standardization Integration Copyright Oracle Corporation, 1999. All rights reserved. Management Poor data quality • • • • Own Take responsibility Resolve problems Data quality manager Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-10 Data Warehousing Fundamentals Importance of Data Quality ..................................................................................................................................................... Solutions Use conventional COBOL or 4GL programs or purchase a specialized tool to capture and eradicate anomalies prior to data load. It is often very difficult to predict all possible variants. You may consider designing a process in-house to assure the quality of the data entering the data warehouse. The process must involve: • Data investigation: Parsing, lexical analysis, and pattern investigation • Data conditioning and standardization: Moving the data into fixed fields, standardizing names and addresses • Data integration: Building unique keys and integrating the data You should involve the business experts in the entire warehouse ETT process. Management You must manage the quality of the data, processes, and rules, and put people in place to manage them. Someone must own, be directly responsible for, and resolve the issue of poor data quality. This person is often known as the data quality manager. Note: At some sites there is a person or a group responsible for name and address management alone. ..................................................................................................................................................... Data Warehousing Fundamentals 11-11 Lesson 11: Transforming Data ..................................................................................................................................................... Transformation Clean up Consolidate Restructure Operational system Extract Data staging area Transport (Load) Transform Warehouse Transformation eliminates operational data anomalies • • • Cleans Standardizes Presents subject-oriented data Copyright Oracle Corporation, 1999. All rights reserved. Source Data Anomalies • • • • No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies CUSNUM NAME 90328575 90328575 90238475 90233479 90233489 90234889 90345672 Oracle Corp Oracle Oracle Services Oracle Limited Oracle Computing Oracle Corp. UK Oracle Corp UK Ltd ADDRESS 100 NE 1st Street, Tampa 100 NE. First St., Tampa 100 North East 1st St., FLA 100 N.E. 1st St. 15 Main Road, Ft. Lauderdale 15 Main Road, Ft. Lauderdale, FLA 181 North Street, Key West, FLA Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-12 Data Warehousing Fundamentals Transformation ..................................................................................................................................................... Transformation Transformation involves a number of tasks, the most important being to eliminate all anomalies. Cleaning also includes eliminating formatting differences, assigning data types, defining consistent units of measure, and determining encoded structures. Along with these tasks, another objective is to ensure that the data is presented in a subject-oriented fashion. Reasons for Data Anomalies One of the causes of inconsistencies within internal data is that in-house system development takes place over many years, often with different software and development standards for each implementation. There may be no consistent policy for the software used in the corporate environment. Systems may be upgraded or changed over the years. Each system may represent data in different ways. Source Data Anomalies Many potential problems can exist with source data: • No unique key for individual records • Anomalies within data fields, such as differences between naming and coding (data type) conventions • Differences in the interpreted meaning of the data by different user groups • Spelling errors and other textual inconsistencies (this is particularly relevant in the area of customer names and addresses) ..................................................................................................................................................... Data Warehousing Fundamentals 11-13 Lesson 11: Transforming Data ..................................................................................................................................................... Transformation Routines • • • • • • Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-14 Data Warehousing Fundamentals Transformation ..................................................................................................................................................... Transformation Routines One reason for the inconsistencies with internal data is that in-house system development takes place over many years and often uses different software and standards for each implementation. • Cleaning the data, also referred to as data cleansing or scrubbing • Adding an element of time to the data, if it does not already exist • Translating the formats of external and purchased data into something meaningful for the warehouse • Merging rows or records in files • Integrating all the data into files and formats to be loaded into the warehouse Transformation should be performed: • Before the data is loaded into the warehouse • In parallel (On larger databases, there is not enough time to perform this process as a single threaded process.) The transformation process should be self-documenting, should generate summary statistics, and should process exceptions. ..................................................................................................................................................... Data Warehousing Fundamentals 11-15 Lesson 11: Transforming Data ..................................................................................................................................................... Transforming Data: Problems and Solutions Multipart keys Product code = 12M65431345 Country Sales code territory Product Salesperson number code Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-16 Data Warehousing Fundamentals Transforming Data: Problems and Solutions ..................................................................................................................................................... Transforming Data: Problems and Solutions Multipart Keys Problem Many older operational systems used record key structures that had a built-in meaning. To allow for decision support reporting, these keys must be broken down into atomic values. In the example, the key contains four atomic values. Key Code:12M65431345 Where: 12 is the country code M is the sales territory 65431 is the product code 345 is the salesperson Solution The program or tools you use must be capable of identifying on a character-by-character (or position-by-position) basis the individual values, length of value, and the meaning of the resulting information. In the example quoted it is important that the code can extract the M and know that this is a territory code that identifies “Midwest,” “Manchester,” or “Moscow.” You may need to build a series of transforms to evaluate the results fully. For example, these steps may be appropriate: 1 Extract third character position. 2 Evaluate the character against a master lookup table. 3 Evaluate the meaning of M. 4 Store the meaning (Moscow) in a field for insertion into the data warehouse. ..................................................................................................................................................... Data Warehousing Fundamentals 11-17 Lesson 11: Transforming Data ..................................................................................................................................................... Transforming Data • Multiple encoding m,f 1,0 m, f male, female • Must pick up erroneous data mle, female 1 , NULL If field not in (‘m’,1,’male’) then … m, f else if field is NULL then … Copyright Oracle Corporation, 1999. All rights reserved. Transforming Data • • Multiple local standards Tools or filters to preprocess cm cm inches DD/MM/YY DD-Mon-YY MM/DD/YY 1,000 GBP USD 600 FF 9,990 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-18 Data Warehousing Fundamentals Transforming Data: Problems and Solutions ..................................................................................................................................................... Multiple Encoding Problem Some systems may represent values in different ways. For example, some systems may use M to denote “male” and F to denote “female”, while others use 1 and 0, or even NULL values. Solution The program must be capable of identifying all the distinct possibilities and program for exceptions. For example, your program considers a male might be either M, or NULL, or Male, but it does not take into account spurious and bad entries such as Man, Mle, N/A. Your program must be capable of picking up the spurious and bad entries and changing the values to something appropriate, such as: 1 Select all M, or NULL, or Male. 2 Place all other records into a file for reprocessing. 3 Interpret records to be reprocessed and determine from other related values in the record whether the person is male or female. 4 Change value accordingly, and reprocesses rows selecting newly marked records. Multiple Local Standards Problem This is particularly relevant for values entered in different countries. For example, some countries use imperial measurements and others metric; currencies and date formats differ; currency values and character sets may vary; and numeric precision values may differ. Currency values are often stored in two formats, a local currency such as sterling, French francs, or Australian dollars, and a global currency such as U.S. dollars. Solution Typically, you use tools or filters to preprocess this data into a suitable format for the database, with the logic needed to interpret and reconstitute a value. You might employ steps similar to those identified for multiple encoding. You may consider revising source applications to eliminate these inconsistencies early on. ..................................................................................................................................................... Data Warehousing Fundamentals 11-19 Lesson 11: Transforming Data ..................................................................................................................................................... Multiple Files Problem • • Added complexity of multiple source files Start simple Multiple source files Extracted data Logic to detect correct source Copyright Oracle Corporation, 1999. All rights reserved. Transforming Data from Multiple Files File File File File 16 14 12 10 8 6 4 2 0 File File File File File Conflict and integration points 2 3 4 5 6 Sources to be Incorporated Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-20 Data Warehousing Fundamentals Transforming Data: Problems and Solutions ..................................................................................................................................................... Multiple Files Problem The source of information may be one file for one condition, and a set of files for another. Logic (normally procedural) must be in place to detect the right source. The complexity of integrating data is greatly increased according to the number of data sources being integrated. For example, if you are integrating data from two sources, there is a single point of integration where conflicts must be sorted. Integrate from three sources, and there are three points of conflict. Four sources provide six conflict points. The problem is exponential. Solution This is a complex problem that requires the use of tools or welldocumented transformation mechanisms. Try not to integrate all the sources in the first instance. Start with two or three and then enhance the program to incorporate more sources. Build on your learning experiences. ..................................................................................................................................................... Data Warehousing Fundamentals 11-21 Lesson 11: Transforming Data ..................................................................................................................................................... Missing Values Problem Solution • • • • Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘A’ A Copyright Oracle Corporation, 1999. All rights reserved. Duplicate Value Problem Solution • • SQL self-join techniques RDMBS constraint utilities ACME Inc SELECT … FROM table_a, table_b WHERE table_a.key (+) = table_b.key UNION SELECT … FROM table_a, table_b WHERE table_a.key = table_b.key (+) ACME Inc ACME Inc ACME Inc Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-22 Data Warehousing Fundamentals Transforming Data: Problems and Solutions ..................................................................................................................................................... Missing Values Problem Null, missing, and default values are always an issue. NULL values may be valid entries where NULLs are allowed; otherwise, NULLs indicate missing values. Solution You must examine each occurrence of the condition to determine validity and decide whether these occurrences must be transformed; that is, identify whether a NULL is valid or invalid (missing data). You may choose to: • Ignore the missing data. If the volume of records is relatively small, it may have little impact overall. • Wait to extract the data until you are sure that missing values are entered from the operational system. • Mark rows when extracted, so that on the next extract you can select only those rows not previously extracted. It does involve the overhead of SELECT and UPDATE, and if the extracted data forms the basis of a summary table, these need re-creating. • Extract data only when it is time-stamped as completed, rather than by business cycle. Duplicate Value Problem You need to eliminate duplicate values, which invariably exist. This can be timeconsuming, although it is a simple task to perform. Solution You can use standard SQL self-join techniques or RDBMS constraint utilities to eliminate duplicates. ..................................................................................................................................................... Data Warehousing Fundamentals 11-23 Lesson 11: Transforming Data ..................................................................................................................................................... Element Names Problem • • a recoro f as Browser:Cus to me + http:// rs: Solution Hol lywood X Customer CTAS SQL*Loader Browser: http:// X Browser: http://C us Hollywoo d 12345.00 12780.00 2345787.00 87877.98 5678.00 + Hollywood tom ers+ X : 100% ABC CO 110% GMBH LTD 230% GBUK INC 200% FFR ASSOC -10% MCD CO Client Customer Contact Name Copyright Oracle Corporation, 1999. All rights reserved. Element Meaning Problem Customer’s name me rs: a recoro f http:// Hollywood Hollywoo d Cu sto + All details except name • • • X as Browser: All customer details Customer_detail Avoid misinterpretation Complex solution Document meaning in metadata Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-24 Data Warehousing Fundamentals Transforming Data: Problems and Solutions ..................................................................................................................................................... Element Names Problem Individual attributes, columns, or fields may vary in their naming conventions from one source to another. These need to be eliminated to ensure that one naming convention is applied to the value in the warehouse. If you are employing independent data marts, then you should ensure that the ETT solution is mirrored; should you plan to employ the data marts dependently in the future, they will all refer to the same object. Solution You need to obtain agreement from all relevant user groups on renaming conventions, and rename the elements accordingly. Document the changes in metadata. The programs you use determine the solution. For example, if you are using SQL CREATE TABLE AS (CTAS), the new column name is used in that statement. If you use SQL*Loader as an intermediary mechanism prior to load, you create your destination object with the agreed naming convention applied. Agreement on the name change and the meaning of the data can become a political issue between groups and departments in the organization. Element Meaning Problem Like the name of an element, the meaning is often interpreted differently by different user groups. The variations in naming conventions typically drive this misinterpretation. You need to keep your model independent of naming conventions that may be popular today, but subject to change. Solution It is a difficult problem, often political, but you must ensure that the meaning is clear. By documenting the meaning in metadata you can solve this problem, especially if the meaning is composed of several elements and algorithms have been used. In order to take information from the operational system into the warehouse, you must know the meaning of the data. This may involve rebuilding the transaction from its component parts (which are likely in a normalized state). You must know the: • Business rules • Processes executed for a type of transaction, such as the tables that are updated This is a complex task, which may involve merging or separating data components, extracting values from multipart keys, and much more. ..................................................................................................................................................... Data Warehousing Fundamentals 11-25 Lesson 11: Transforming Data ..................................................................................................................................................... Input Format Problem EBCDIC ASCII “123-73” 12373 ACME Co. áøåëéí äáàéí Beer (Pack of 8) Copyright Oracle Corporation, 1999. All rights reserved. Referential Integrity Problem Solution • • • SQL anti-join Server constraints Dedicated tools Department 10 20 30 40 Emp 1099 1289 1234 6786 Name Smith Jones Doe Harris Department 10 20 50 60 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-26 Data Warehousing Fundamentals Transforming Data: Problems and Solutions ..................................................................................................................................................... Input Format Problem Input formats vary considerably. For example one entry may accept alphanumeric data, so the format may be “123-73”. Another entry may accept numeric data only, so the format may be “12373”. You may also need to convert from ASCII to EBCDIC, or even convert complex character sets such as Hebrew, Arabic, or Japanese. Solution First, ensure that you document the original and the resulting formats. Your program (or tool) must then convert those data types either dynamically or through a series of transforms into one acceptable format. You can use Oracle SQL*Loader to perform certain transformations, such as EBCDIC to ASCII conversions and assigning values to default or NULL values. Referential Integrity Problem If the constraints at the application or database level have in the past been less than accurate, child and parent record relationships can suffer; orphaned records can exist. You must understand data relationships built into legacy systems. The biggest problem encountered here is that they are often undocumented. You must gain the support of users and technicians to help you with analysis and documentation of the source data. Solution This is a simple cleaning task, but it is time-consuming and requires business experience to resolve the inconsistencies. You can use SQL anti-join query techniques, server constraint utilities, or dedicated tools to eliminate these inconsistencies. ..................................................................................................................................................... Data Warehousing Fundamentals 11-27 Lesson 11: Transforming Data ..................................................................................................................................................... Name and Address Problem • • • • • • • No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines NAME LOCATION Database 1 DIANNE ZIEFELD HARRY H. ENFIELD FRED AND SARA MULLEN N100 D589 M300 Database 2 ZIEFLED, DIANNE ENFIELD, HARRY H MULLEN, SARA AND FRED 100 589 300 Copyright Oracle Corporation, 1999. All rights reserved. Name and Address Problem • Single-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 • Multiple-field format Name Street Town County Code Mr. J. Smith 100 Main St. Bigtown County Luth 23565 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-28 Data Warehousing Fundamentals Transforming Data: Problems and Solutions ..................................................................................................................................................... Name and Address Problem One of the largest areas of concern, with regard to data quality, is how name and address information is held, and how to transform it. Name and address information has historically suffered from a lack of legacy standards. This information has been stored in many different formats, sometimes dependent upon the software or even the data processing center used. Usual Inconsistencies Some of the following data inconsistencies may appear: • No unique key • Missing data values (NULLs) • Personal and commercial names mixed • Different addresses for same member • Different names and spelling for same member • Many names on one line • One name on two lines • The data may be in a single field of no fixed format: Mr. J. Smith, 100 Main St., Bigtown, County Luth, 23565 Each component of an address may be in a specific field: Mr. J. Smith 100 Main St. Bigtown County Luth 23565 ..................................................................................................................................................... Data Warehousing Fundamentals 11-29 Lesson 11: Transforming Data ..................................................................................................................................................... Clean and Organize 1. Create atomic values. 2. Standardize formats. 3. Verify data accuracy. 4. Match with other records. 5. Identify private and commercial addresses and inhabitants. 6. Document in metadata. Requires sophisticated tools and techniques Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-30 Data Warehousing Fundamentals Transforming Data: Problems and Solutions ..................................................................................................................................................... Name and Address Problem (continued) Solution Name and address cleanup involves a series of complex processes that decompose and reassemble data. It can be broken down into a number of steps; those identified here represent just one example. Mr. J. Smith, 100 Main St., Bigtown, County Luth, 23565 Steps to Clean and Organize 1 Break the record down into atomic values, each of which has a description. Value Title First Initial Last Name House Number .... Description Mr. J Smith 100 .... 2 Ensure that all elements appear in a standard format, so that St. in this example becomes Street. This element needs to be recoded, as do other similar elements, such as Rd and Cres. 3 Verify the accuracy of standard elements using data from external sources. – Is Bigtown actually associated with this postal code? – Is Bigtown in County Luth? – Is County Luth associated with this postal code? 4 Check whether there are any other customers with the name Smith. If there are, verify whether the addresses are identical; if they are not, then one is probably the current address and others are old addresses. You probably have to refer to external data to check this. Mark records with notes such as previous and current. 5 Identify whether there is more than one customer record for any given address. You may find a Smith, and a Doe, and a Jones all at 100 Main Street. Are they all resident in the same house or apartment? 6 Document the results of these steps in metadata. You can see from the complexity of even this simple example that this cleanup requires sophisticated software techniques, tools, or expert knowledge in coding the algorithms required to perform each step. ..................................................................................................................................................... Data Warehousing Fundamentals 11-31 Lesson 11: Transforming Data ..................................................................................................................................................... Merging Data • Operational transactions do not usually map one-to-one with warehouse data • Data for the warehouse is merged to provide information for analysis Pizza sales/returns by day, hour, seconds Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Copyright Oracle Corporation, 1999. All rights reserved. Merging Data Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 Browser: http:// Cu stom ers: + 12:00:02 Anchovy Pizza $12.00 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 Sale 12:00:04 Sausage Pizza $11.00 1/2/98 Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 a reco rof 1/2/98 as XX Sale H Hollywo ollywood od Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-32 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Transformation Techniques Merging Data An operational transaction does not usually have a one-to-one mapping with data in the warehouse, even if the data in the warehouse is maintained at the transaction level. For example, consider a sales transaction in a store. The logical transaction comprises a number of components such as date of sale, charge amount, number of items, discount amount, and payment method. The transaction may even be a return. A customer purchase and a customer return are very different types of sales transactions, and different business rules must apply. For each different transaction a different process occurs. A purchase depletes inventory and a return adds stock back into inventory. The result is, for the warehouse, that the data you are keeping is held for purely reporting purposes and these transactions become merged into data that is useful for that purpose. The data will not, in the end, map strictly to sales or returns. ..................................................................................................................................................... Data Warehousing Fundamentals 11-33 Lesson 11: Transforming Data ..................................................................................................................................................... Adding a Date Stamp • • • Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data Copyright Oracle Corporation, 1999. All rights reserved. Adding a Date Stamp Product Table Product_id Time_key Product_desc Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Item Table Item_id Dept_id Time_key Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-34 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Adding a Date Stamp Time is important within the data warehouse. You have already looked at the time dimension, which is always created in the warehouse in order to provide reporting by time periods. Extracted source data probably does not contain time information, because it is not typical of time-stamp information in operational systems (unless of course they too are maintaining history, or time is a critical component). More likely the record in the operational system has a value associated with it, such as Order_date, Ship_date, or Call_date. Therefore it is important to consider how you are going to add a time element to your warehouse data. This is particularly important for two areas of the warehouse: • Fact tables that hold vast amounts of data used to analyze the business according to time periods • Dimension data containing criteria by which you perform the analysis You need to consider how to manage time for both of these areas, in slightly different ways. ..................................................................................................................................................... Data Warehousing Fundamentals 11-35 Lesson 11: Transforming Data ..................................................................................................................................................... Adding a Date Stamp • Fact table – Add triggers – Recode applications – Compare tables • • Dimension table Time representation – Point in time – Time span Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-36 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Adding a Date Stamp (continued) Fact Table Data Imagine that you need to add the next set of records from the source systems to your fact table. You need to determine which records are to be moved into the fact table. You have added data for March 1998. Now you need to add data for April 1998. You need to find a mechanism to stamp records so that you pick up only April 1998 records for the next refresh. You might choose from a number of techniques: Coded application or database triggers at the operational level to time-stamp data, which can then be extracted using date selection criteria. • Perform a comparison of tables, original and new, to identify differences. • Maintain a table containing copies of changed records to be loaded. You must decide which are the best techniques for you to use according to your current system implementations. These are discussed in greater detail later in the course. Dimension Table Data Dimensions change also and there are many different techniques you can employ to trap changes. Some of these were identified earlier with fact tables. Time Representation The time may be represented as: • A single point-in-time date • A date range (start and end date) The time element must either be available in the data before loading into the warehouse, or added when loading the data. ..................................................................................................................................................... Data Warehousing Fundamentals 11-37 Lesson 11: Transforming Data ..................................................................................................................................................... Adding Keys to Data #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 #5 Sale 12:00:04 Sausage Pizza $11.00 1/2/98 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-38 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Adding Keys to Data You are moving the data from one structure, with its keys defining relationships, into another that is totally different and must also have keys defining relationships. The transformation of this data also includes adding keys (generalized or artificial) or creating keys from existing data values. Note: Creating keys is discussed in more detail later in the course. ..................................................................................................................................................... Data Warehousing Fundamentals 11-39 Lesson 11: Transforming Data ..................................................................................................................................................... Summarizing Data During extraction on staging area Hollywoo d a recoro Browser: Cus tom http:// f After loading onto the warehouse server ers + : as • • X Operational databases Staging area Warehouse database Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-40 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Creating Summary Data Creating summary data is essential for a data warehouse to perform well. Here it is classified under transformation only because you are changing the way the data exists in the source system into something else for the data warehouse. In reality, the summary data is usually created on the warehouse server after transformation. Summarizing Data You can summarize the data: • At the time of extraction in batch routines. This reduces the amount of work performed by the data warehouse server, as all the effort is concentrated on the source systems. However, summarizing at this time increases: – The complexity and time taken to perform the extract – The number of files created – The number of load routines – The complexity of the scheduling process • After the data is loaded into the warehouse database. The process queries the fact data, summarizes it, and places it into the requisite summary fact table. This method reduces the complexity and time taken for the extract tasks. However, it places all the CPU and I/O intensive work on the warehouse server, thus increasing the time that the warehouse is unavailable to the users. You should weigh the benefits of each method and determine your strategy according to your requirements and resources. ..................................................................................................................................................... Data Warehousing Fundamentals 11-41 Lesson 11: Transforming Data ..................................................................................................................................................... Maintaining Transformation Metadata Contains transformation rules, algorithms, and routines Browser: Cus http:// Browser: http:// X + Hollywood er+sX : a rec orof as Hollywood tom Cus Browser: http:// tom er+ X s: Hollywood Sources Stage Rules Extract Transform Publish Load Query Copyright Oracle Corporation, 1999. All rights reserved. Maintaining Transformation Metadata • • • • • • • Key restructuring Coding differences Multiple sources Exception rules Format differences Referential integrity fixes Aggregated data Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-42 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Maintaining Transformation Metadata As with the extraction process, metadata must be maintained for the transformation process. • Information on how to perform key restructuring • Logic to eliminate different coding methods and data values, parsing rules • Logic to detect multiple source files • Logic and exception rules to handle NULL, negative values, and default values and to eliminate and consolidate duplicate values • Element renaming conventions • Granularity conversions, input or language formats, conversion algorithms, and data standardization rules • Referential integrity fixes • Logic and program names used to create summary data • Transformation frequency, program name, location, failure procedures, and validation • Temporary extraction storage location, name, and source contact The metadata also contains information about the frequency of program execution. Data repair usually involves using simple algorithms or more complex artificial intelligence programs to correct data. Note: There is a lesson dedicated to metadata later in the course. ..................................................................................................................................................... Data Warehousing Fundamentals 11-43 Lesson 11: Transforming Data ..................................................................................................................................................... Data Ownership and Responsibilities • • • Operational and application development teams Data warehouse development team Business benefit gained with a one-team approach Browser: Holly wood Hollywood C us tom ers:+ Browser: http:// XX + Hollywoo Ho llywoodd XX Browser: C us tom http:// ers:+ Holly wood Hollywood a reco rof as http:// XX Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-44 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Data Ownership and Responsibilities Ownership The data extracted from the source systems is often under the control and ownership of application development teams who have been working with the operational data since its inception. The loading of the data into the warehouse is usually under the control of the data warehousing development team. This raises the question of who is responsible for the transformation of the data: the process between developing and loading the data into the warehouse. Working as One Team These two teams must work together—those responsible for operational data and those responsible for warehouse data. It brings all the required knowledge together and produces the best solution. Working together enhances understanding, knowledge, teamwork, and a leveling of roles within the groups. • The operational team may be critical to ensuring the success of the data extraction and providing the data warehouse team with extract files in requisite formats (for example C, COBOL, PL/SQL). • The data warehouse team can then take on the task of making sure the extracted data is accurate and of sufficiently high quality for the warehouse. If there is a need to reconsider how the operational data is entered (stored at the database level), to improve the ease of creating extracts and the quality of extract data, then teamwork and understanding of each other’s areas are critical. ..................................................................................................................................................... Data Warehousing Fundamentals 11-45 Lesson 11: Transforming Data ..................................................................................................................................................... Transformation Timing and Location • Transformation is performed: – Before load – In parallel • May be initiated at different points 12M65431 12M65431 12 M 65431 12 M 65431 12-m-65421 12-m-65421 12 m 65421 12 M 65421 “12m65421” “12m65421” 12 m 65421 12 m 65421 “12m65421” “12m65421” “ ” 12M65431 Unlikely “ ” 12M65431 Probable Possible Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-46 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Transformation Points You need to consider carefully when and where you perform transformation. You must perform transformation before the data is loaded into the warehouse, and in parallel; on larger databases, there is not enough time to perform this process as a single threaded process. Consider the different places and points in time where transformation may take place. On the Operational Platform This approach transforms the data on the operational platform, where the source data resides. The negative impact of this approach is that the transformation operation conflicts with the day-to-day working of the operational system. If it is chosen, the process should be executed when the operational system is idle or less utilized. The impact of this approach is so great that is very unlikely to be employed. In a Separate Staging Area This approach transforms data on a separate computing environment, the staging area, where summary data may also be created. This is a common approach because it does not affect either the operational or warehouse environment. Cleaning, merging, and removal of anomalies are handled in the staging area, and summary creation may take place: • On the staging server • On the warehouse server On the Warehouse Server You may consider performing transformations on the warehouse server itself. However, this may affect the effectiveness of the server for query access. It is more likely that you transform away from the warehouse server. ..................................................................................................................................................... Data Warehousing Fundamentals 11-47 Lesson 11: Transforming Data ..................................................................................................................................................... Choosing a Transformation Point • • • • Workload • Network bandwidth Environment impact • Parallel execution CPU use • Load window time Disk space • User information needs Copyright Oracle Corporation, 1999. All rights reserved. Monitoring and Tracking Transforms should: • • • Be self-documenting Provide summary statistics Handle process exceptions 12M65431 12M65431 12 M 65431 12 M 65431 12-m-65421 12-m-65421 12 m 65421 12 M 65421 “12m65421” “12m65421” 12 m 65421 12 m 65421 “12m65421” “12m65421” 1 “ ” 12M65431 “ ” 2 3 4 5 1,200 1,400 100 6,001 20,890 12M65431 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-48 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Choosing a Transformation Point The approach you choose depends upon operational requirements. You must balance many different factors in order to determine the best solution. Consider: • The actual workload (time to complete) of the transformations needed to provide the data for the warehouse • The physical impact on each of the environments you might choose. (This is particularly relevant if you choose to use the operational platform.) • The available CPU and disk space (for temporary and intermediate data and file store) on each environment • The available network and bandwidth between environments, affecting transfer volumes • Whether the environment is capable of working in a parallel manner • The load window time constraints • The information needs of the business user. (When do they need this data? How often do refreshes occur?) Monitoring and Tracking The transformations should be self-documenting, should generate summary statistics, and should be able to process exceptions. ..................................................................................................................................................... Data Warehousing Fundamentals 11-49 Lesson 11: Transforming Data ..................................................................................................................................................... Designing Transformation Processes • Analysis: – Sources and target mappings, business rules – Key users, metadata, grain • Design options: PL/SQL, replication, custom, third-party tools • Design issues: – Performance – Size of the staging area – Exception handling, integrity maintenance Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-50 Data Warehousing Fundamentals Transformation Techniques ..................................................................................................................................................... Designing Transformation Processes When designing your transformation processes, consider the analysis issues, the design options available to you, and the design issues. Analysis • Source and target mappings • Business rules • Key users • Metadata • Granularity of the fact data and summaries Design Options • PL/SQL • Replication • Custom 3GL programs • Third-party tools Design Issues • Performance and throughput • Sizing the staging areas to hold the data to be loaded into the warehouse • Exception handling • Integrity maintenance ..................................................................................................................................................... Data Warehousing Fundamentals 11-51 Lesson 11: Transforming Data ..................................................................................................................................................... Transformation Tools • • • Purchased SQL*Loader In-house developed Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-52 Data Warehousing Fundamentals Transformation Tools ..................................................................................................................................................... Transformation Tools Many of the purchased transformation tools perform extraction as well. The choice of transformation tool may already have been decided when you chose the extraction tool. However, transformation can be performed by: • Tools purchased from specialized vendors both third-party and Oracle • SQL*Loader. This is an Oracle product that is commonly used to transport large volumes of data into the warehouse tables. It can also provide you with simple data transformations, such as multiple records becoming a single record, or conversely a single record at source becoming multiple records for the data warehouse. • In-house developed programs and procedures using 3GL products such as C, C++, COBOL, or 4GL products such as SQL and PL/SQL. The DECODE SQL function can be used to test a value and change it to another value. For example, change “M” and “F” to Male and Female. DECODE is fast, because it is a SQL set processing function and takes advantage of parallel processing. You should be aware that PL/SQL does not take advantage of parallel processing capabilities and is slower than DECODE because it processes row by row. ..................................................................................................................................................... Data Warehousing Fundamentals 11-53 Lesson 11: Transforming Data ..................................................................................................................................................... Data Management, Quality, and Auditing Tools • Data management: – Innovative Systems – Postalsoft – Vality Technology • Data quality and auditing: – Innovative Systems – Vality Technology Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-54 Data Warehousing Fundamentals Transformation Tools ..................................................................................................................................................... Data Management, Quality and Auditing Tools Management Tools WTI Partner Product Innovative Systems, Inc. Innovative Warehouse Postalsoft, Inc. Address Correction and Encoding (ACE) Vality Technology, Inc. Integrity Data Re-engineering Tool Quality and Auditing Tools WTI Partner Product Innovative Systems, Inc. ISI Analyzer System Vality Technology, Inc. Integrity Data Re-engineering Tool ..................................................................................................................................................... Data Warehousing Fundamentals 11-55 Lesson 11: Transforming Data ..................................................................................................................................................... Summary This lesson discussed the following topics: • • • • • • Importance of data quality Transformation process Data transformation issues Data anomalies Name and address management Tools Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-56 Data Warehousing Fundamentals Summary ..................................................................................................................................................... Summary This lesson addressed the following topics: • The importance of data quality in the warehouse • The transformation process • Transformation issues • Anomalies that may exist in legacy systems • Name and address management • Tools available for extraction, transformation, and data quality ..................................................................................................................................................... Data Warehousing Fundamentals 11-57 Lesson 11: Transforming Data ..................................................................................................................................................... Practice 11-1 Overview This practice covers the following topics: • • Answering a series of short questions Specifying true or false to a series of statements Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 11-58 Data Warehousing Fundamentals Practice 11-1 ..................................................................................................................................................... Practice 11-1 1 Dirty data must be eliminated for the data warehouse. Name three alternative and common terms used to describe the process of eliminating anomalies in data. _____________________ _____________________ _____________________ 2 Name at least five problems associated with source data that must be eliminated for the data warehouse. ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ 3 Identify whether the following statements are true or false. Question True False It is considered impractical to eliminate data anomalies after the pilot run. You need to consider adding time keys to warehouse data. Transformation can be performed before or after data is loaded into the warehouse. ..................................................................................................................................................... Data Warehousing Fundamentals 11-59 Lesson 11: Transforming Data ..................................................................................................................................................... ..................................................................................................................................................... 11-60 Data Warehousing Fundamentals 12 ................................. Transportation: Loading Warehouse Data Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Planning Warehouse Storage Choosing a Computing Architecture Meeting a Business Need Modeling the Data Warehouse ETT ETT (Building (Building the the Warehouse) Warehouse) Analyzing User Query Needs Managing the Data Warehouse Supporting End User Access Project Management (Methodology, Maintaining Metadata) Copyright Oracle Corporation, 1999. All rights reserved. Objectives After completing this lesson, you should be able to do the following: • Explain key concepts in transporting data into the warehouse • Outline how to build the transportation process for first time load • • Identify transportation techniques • Explain the issues involved in designing the transportation, loading, and scheduling processes Identify the tasks that take place after data is loaded Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-2 Data Warehousing Fundamentals Overview ..................................................................................................................................................... Overview In the last two lessons, you examined extraction and transformation issues. In this lesson, you examine how the extracted and transformed data is transported into the warehouse as the first-time loading of data. Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page. Objectives At the end of this lesson, you should be able to: • Explain key concepts in transporting data into the warehouse • Outline how to build the transportation process for the first time load • Identify transportation techniques • Identify the tasks which take place after data is loaded • Explain the issues involved in designing the transportation, loading, and scheduling processes ..................................................................................................................................................... Data Warehousing Fundamentals 12-3 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Transporting Data into the Warehouse • • Loading moves the data into the warehouse Loading can be time-consuming: – Consider the load window. – Schedule the task; automate all processes. • • • Initial load moves large volumes Subsequent refresh moves smaller volumes Business determines the cycle Operational System Extract Data Staging Area Transport (load) Transform Warehouse Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-4 Data Warehousing Fundamentals Transporting Data into the Warehouse ..................................................................................................................................................... Transporting Data into the Warehouse Transportation Tasks The transportation process moves data from source data stores or an intermediate staging area and loads it into the target warehouse database in the target system server. This process comprises a series of actions, such as moving the data and loading data into tables. There may also be some processing of objects after the load, often referred to as postload processing. Moving and Loading Data To move and load the data can be a time-consuming task, depending upon the volumes of data, the hardware, the connectivity setup, and whether parallel operations are in place. The time period within which the warehouse system can perform the load is called the load window. Loading should be scheduled and prioritized. You should also ensure that the loading is automated as much as possible. Types of Data Load There is a single first-time load that moves large volumes of data when the warehouse is implemented. The first-time load is followed by regular refreshes of the warehouse with smaller volumes of data, the grain and frequency of which is determined by the business user requirements. ..................................................................................................................................................... Data Warehousing Fundamentals 12-5 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Extract Processing Environment Operational databases T1 • • T2 T3 After each time interval, build a new database Run queries Copyright Oracle Corporation, 1999. All rights reserved. Warehouse Processing Environment Operational databases T1 • • • • T2 T3 Build a new database After each time interval, add changes to database Archive or purge oldest data Run queries Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-6 Data Warehousing Fundamentals Transporting Data into the Warehouse ..................................................................................................................................................... Data Refresh Models First, to ensure that you understand how the warehouse data presentation differs from nonwarehouse data presentation, consider how up-to-date data is presented to users in two different decision support environments: a simple extract processing environment and a data warehouse environment. Extract Processing Environment A snapshot of operational data is taken at regular time intervals: T1, T2, and T3. At each interval a new snapshot of the database is created and presented to the user; the old snapshot is purged. Warehouse Environment An initial snapshot is taken and the database is loaded with data. At regular time intervals, T1, T2, and T3, a delta database or file is created and the warehouse is refreshed. A delta contains only the changes made to operational data that need to be reflected in the data warehouse. • The warehouse fact data is refreshed according to the refresh cycle determined by user requirements analysis. • The warehouse dimension data is updated to reflect the current state of the business, only when changes are detected in the source systems. • The older snapshot of data is not removed, ensuring that the warehouse contains the historical data needed for analysis. • The oldest snapshots are archived or purged only when the data is not required any longer. ..................................................................................................................................................... Data Warehousing Fundamentals 12-7 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... First-Time Load • Single event that populates the database with historical data • • • Involves large volume of data Employs distinct ETT tasks Involves large amounts of processing after load Operational databases T1 T2 T3 Copyright Oracle Corporation, 1999. All rights reserved. Refresh • • • • • Performed according to a business cycle Simpler task Less data to load than first-time load Less-complex ETT Smaller amounts of postload processing Operational databases T1 T2 T3 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-8 Data Warehousing Fundamentals Transporting Data into the Warehouse ..................................................................................................................................................... First-Time Load and Refresh First-Time Load The first time load (sometimes called an initial load) is a single event that occurs prior to implementation. It populates the data warehouse database with as much data as needed or available. The first-time load moves data in the same way as the regular refresh. However, the complexity of the task is made greater due to: • Data volumes that may be very large (Your company decides to load the last five years of data, which may comprise millions of rows. The time taken to load the data may be in days rather than hours.) • Distinct extraction and transformation tasks that are applicable only to this older data • The task of populating all fact tables, all dimension tables, and any other ancillary tables you may have created such as reference tables • Postprocessing of loaded data, with tasks that must work on the large data volumes, such as indexing and key generation • Postload processing on large volumes of data, such as creating summary tables With all the issues surrounding first time load, it is a task not to be considered lightly. You must plan, prepare, and have recovery capabilities built in to your processing routines to ensure success. Refresh After the first time load, the refresh is performed on a regular basis according to a cycle determined by users. The cycle may be daily, weekly, monthly, quarterly, or any other business period. The refresh is a simpler task than first time load for these reasons: • There is less fact data to load. You are moving a new snapshot of data but not all fact data into the data warehouse. • There is no dimension data to load (unless your model has changed, which would be an exception). There may be some dimensional data changes to incorporate. • Less-complex extraction and transformation processes may be needed. Additionally, because these processes are executed regularly, they can be monitored, tested, and improved for each refresh until they run as optimally as possible. • Postload processing time is reduced and there is less new data to work with. ..................................................................................................................................................... Data Warehousing Fundamentals 12-9 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Building the Transportation Process Specification • • • • • • • Techniques and tools File transfer methods The load window Time window for other tasks First-time and refresh volumes Frequency of the refresh cycle Connectivity bandwidth Copyright Oracle Corporation, 1999. All rights reserved. Building the Transportation Process • • • • • • Test the proposed technique Document proposed load Gain agreement on the process Monitor Review Revise Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-10 Data Warehousing Fundamentals Building the Transportation Process ..................................................................................................................................................... Building the Transportation Process Specifying the Process You need to identify early on in the development process how you are going to move the data from the source systems into the data warehouse. You must identify: • The data movement techniques and tools available • File transfer methods and transfer models available • The time available to load the data into the warehouse—the load window • Determine whether the time window is sufficient for other tasks such as backup, preventative maintenance, and recovery, given expected performance metrics • The volumes of data involved in the first time load and subsequent refreshes • The frequency of the refresh cycle and the grain of the data • Connectivity bandwidth Testing the Process You should test the proposed technique to ensure that volumes can be physically moved within the load window constraints and network capabilities. Documenting the Process You must communicate and document the proposed load with the operations organization to ensure their agreement and commitment to this important process. Monitoring, Reviewing, and Revising the Process You should ensure that the load is constantly monitored and reviewed, and revise metrics where needed. Warehouse data volumes grow rapidly, and metrics for load and data granularity need regular revision. ..................................................................................................................................................... Data Warehousing Fundamentals 12-11 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Granularity • Important design and operational issue • Space requirements • – – Storage – Backup – Recovery – Partitioning Low-level grain • Expensive, high level of processing, more disk, detail High-level grain – Cheaper, less processing, less disk, little detail – Load Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-12 Data Warehousing Fundamentals Building the Transportation Process ..................................................................................................................................................... Granularity You have seen that the grain of the data is important in the warehouse environment. The lower the level of granularity, the more data is loaded, and this affects the amount of time taken to load the data into the warehouse. Low-Level Grain Low-level grain data can be expensive to build and maintain. It requires a large amount of processing power to process the details and provide answers to business queries. It takes up more disk space and could create response time problems. However the detail provides the information needed at a low level to give sophisticated business analysis. High-Level Grain High-level grain data is easier to build and maintain than low level grain data. It requires less processing power and disk space, allows a higher number of concurrent users to access data, and performs well. However, the lack of detail and drill-down capability hinders definitive answers to business questions. Note: The level of granularity affects not only the amount of direct access storage devices (DASD) required for warehouse data, but also the amount of space required for backup, recovery, and partitioning. ..................................................................................................................................................... Data Warehousing Fundamentals 12-13 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Transportation Techniques • • • • • • • Tools Utilities and 3GL Gateways Customized copy programs Replication FTP Manual Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-14 Data Warehousing Fundamentals Transporting the Data ..................................................................................................................................................... Transporting the Data Now that you have seen how to capture the data needed for the refresh, consider how to physically move the data to the warehouse server. Transportation Techniques These common techniques are used to transport data into the warehouse: • Purchased ETT tools • Proprietary data movement utilities that use COBOL, C, or Oracle SQL*Loader, for example. The fastest way to load large amounts of data into the warehouse is to use utilities such as SQL*Loader that can access the database directly, use networks efficiently, and run in parallel environments. • Gateways, which may be vendor-specific or programmable, such as the Oracle Transparent Gateways • Customized copy programs which may employ COBOL, C, PL/SQL, and FTP To a lesser degree these are also solutions: • Replication (database) • File Transfer Protocol (FTP) alone • Manual shipping of the load medium to the data warehouse site ..................................................................................................................................................... Data Warehousing Fundamentals 12-15 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Transportation Technique Considerations • • • Tools are comprehensive but costly. Data-movement utilities are fast and powerful. Gateways are not always the fastest method: – Access other databases – Supply dependent data marts – Support a distributed environment – Provide real-time access if needed Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-16 Data Warehousing Fundamentals Transporting the Data ..................................................................................................................................................... Transportation Technique Considerations Purchased ETT Tools If your IT group has decided to use a customized ETT tool, then it becomes the means by which your data is transported, as well as extracted and transformed. This is not the most common option, particularly for early implementations. Often, because of the cost, copy utilities are the logical alternative. Data-Movement Utilities Oracle implementations use SQL*Loader, which is capable of executing in parallel environments, running in a mode where server intervention is minimized and performing limited transformations, such as merging rows and changing data types. SQL*Loader is capable of loading very large volumes of data in a relatively short time, and you can use it for first-time load and refreshes successfully. Gateways A gateway is a middleware component that presents a unified view of data coming from different data sources. Of note are Oracle Transparent Gateways (or Procedural Gateways), Open Database Connectivity (ODBC) tools, which present a uniform view of a database other than an Oracle database, or a file on specific file systems. Oracle gateways are a mixture of read-only, while other gateways are readwrite. Access to Another Vendor’s Database You should consider using gateway technology in specific instances only, and not on a regular basis. For example, using gateway technology would allow you to access a database that is not an Oracle database directly, without executing the usual extract programs. If the access is to perform a simple SQL SELECT to access data that is to be processed for the warehouse, this is faster than building a specific extract for the task. Develop a Distributed Environment Gateway technology also gives you the ability to develop warehouses on distributed environments, employing technologies (hardware and software) that are not Oracle-specific. Real-Time Data Access It is rare, but there are some data warehouse implementations that are updated in real time. In this situation gateway technology is useful because of the ease of executing remote queries. Consider using gateway technology for this purpose only if it is specifically requested, and you can justify it. ..................................................................................................................................................... Data Warehousing Fundamentals 12-17 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Using SQL*Loader to Load Data Input files Log files Control file SQL*Loader Bad files Discard files • • • • • Fastest load mechanism Direct path Parallel and unrecoverable Direct-load INSERT (Oracle8) Direct-path load API (Oracle8i) Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-18 Data Warehousing Fundamentals Transporting the Data ..................................................................................................................................................... Using SQL*Loader to Load Data The fastest way to load data is using SQL*Loader direct path, parallel, and unrecoverable. Direct Path Load Direct path load is optimized for maximum data loading capability. Instead of filling a bind array buffer and creating INSERT commands, direct path loads create data blocks in Oracle database block format. The blocks are then written directly to the database. It makes calls to Oracle, but they are quick and handled at the start and end of the load process. One direct path load can occur on a table at any one time. Direct-Path Load in Parallel You can run direct path loads in parallel. Parallel loading can load massive amounts of data in short time frames. Use the PARALLEL parameter. Note that conventional path load has the ability to perform parallel loads on the same table, just like any other program or utility that uses SQL INSERT statements. Direct-Path Load in Parallel and Unrecoverable To avoid bottlenecks on redo logs, switch on the UNRECOVERABLE option of SQL*Loader. There is no need to write changes to redo logs in this environment. Direct-Load INSERT In Oracle8, direct-load INSERT enhances performance during insert operations by formatting and writing data directly into Oracle data files without using the buffer cache. It has benefits over direct path load: • Parallel load streams with a single failure do not flag the process to stop. • The data is in Oracle format so the load does not have to convert data. • It does not log redo information and can work in parallel. Direct-Path Load API Oracle8i provides an application programming interface (API) to the direct-path load mechanism in the Oracle Server. This API is described on the next page. ..................................................................................................................................................... Data Warehousing Fundamentals 12-19 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Direct-Path Load API in Oracle8i Load utility • • • • Allows ETT and other tools to load Oracle databases efficiently Permits load behavior to be customized Gives direct-path load performance Provides complete access to all direct-load functionality using OCI Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-20 Data Warehousing Fundamentals Transporting the Data ..................................................................................................................................................... Using Direct-Path Load API in Oracle8i Oracle8i provides an application programming interface to the direct path load mechanism in the Oracle server. This provides a way for independent software vendors and system management tool partners to create easy-to-use and highperformance customized data-loading tools. Access to all load functionality is available through the API. Performance of any third-party data loading tool can therefore be comparable to SQL*Loader. ..................................................................................................................................................... Data Warehousing Fundamentals 12-21 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... More Transportation Technique Considerations • • Use customized programs as a last resort Replication is limited by data-transfer rates Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-22 Data Warehousing Fundamentals Transporting the Data ..................................................................................................................................................... More Transportation Technique Considerations Customized Programs If you are employing Oracle for your warehousing environment, SQL*Loader is recommended. Use customized programs only as a last resort. Replication Replication is rarely used in a data warehouse environment, because of the limitations of data-transfer rates. It is normal to use SQL*Loader or in-housedeveloped loading techniques. If replication is used, it is more likely to be used to feed data marts from a larger warehouse. Note: Replication is not recommended for moving large volumes of data. ..................................................................................................................................................... Data Warehousing Fundamentals 12-23 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Postprocessing of Loaded Data Browser: http:// XX + H Hollywo ollywood od Cu Browser:stom http:// H Hollywo ollywood od a reco rof as Browser: C us http:// Ho llywo od H ollywo od tom XX er+ s: er+s:XX Loaded data Extract Transform Create indexes Transport Generate keys Postprocessing of loaded data Summarize Filter Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-24 Data Warehousing Fundamentals Postprocessing of Loaded Data ..................................................................................................................................................... Postprocessing of Loaded Data You have now seen how to extract data to an intermediate file store or staging area, where it is: • Transformed into acceptable warehouse data • Transported to the warehouse server You have also seen how the ETT process is slightly different for: • First-time load, which requires all data to be loaded once • Refreshing, which requires only changed data to be loaded You now need to consider the different tasks that might take place once the data is loaded. There are various terms used for these tasks. In this course the choice of terms is postprocessing. The post-processing tasks are not definitive; you may or may not have to perform them, depending upon the volumes of data moved, the complexity of transformations, and the transportation mechanism. For example, it is possible to load data using SQL*Loader in a manner that excludes database trigger processing. However, at the warehouse server you do want to ensure the triggers are executed so that the integrity and validity of data are retained. This is referred to as postprocessing. Four categories of postprocessing are explored on the following pages: • Creating indexes • Creating keys • Creating summary tables • Filtering ..................................................................................................................................................... Data Warehousing Fundamentals 12-25 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Indexing Data • • • Before load: fast index reenablement During load: adds time to load window After load: adds time to load window Index Operational databases Staging file Warehouse database Copyright Oracle Corporation, 1999. All rights reserved. Unique Indexes • • Disable constraints to load Enable constraints to create index Disable constraints Load data Enable constraints Create index Catch errors Reprocess Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-26 Data Warehousing Fundamentals Postprocessing of Loaded Data ..................................................................................................................................................... Indexing Data Before Indexing of data may occur prior to load. You can index the data values for the warehouse after data cleansing and before transportation and load. You can retrieve the data from a presorted list of values much more rapidly by reading the index, rather than performing a full-table scan. This makes it easier to reenable indexes at the server level. However, this is not done very often. During It is possible to create the indexes at the same time as loading the data, using the usual techniques employed by the server. However, this action is a row-by-row approach to index creation, which lengthens the time to load data. In most cases the time taken is too long, and for this reason the next option is preferable. After It is common to index after the data has been loaded into the warehouse. This adds time to the load window, but it is much faster than row-by-row processing, and you can speed up the index creation process by indexing in parallel, in a parallel environment. Unique Indexes If the index you are creating is an index that forces unique values in key columns with database constraints, then it is usual to load the data with the database constraints disabled, then enable the constraints. Then you build the index, which may find duplicate values and fail. Ensure that the action catches the errors so that you can correct and reindex. Using SQL, you can employ the EXCEPTIONS INTO clause to catch errors. ..................................................................................................................................................... Data Warehousing Fundamentals 12-27 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Creating Artificial Keys • • • • Use generalized or derived keys Maintain the uniqueness of a row Use an administrative process to assign the key Concatenate operational key with number: – Easy to maintain – Cumbersome keys – No clean value for retrieval 109908 10990801 Copyright Oracle Corporation, 1999. All rights reserved. Creating Unique Keys for Records • Assign a number from a list: – No semantic meaning – Extract operations must reference table to assign numbers 109908 • • 1 Update metadata Verdict Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-28 Data Warehousing Fundamentals Postprocessing of Loaded Data ..................................................................................................................................................... Creating Artificial Keys An artificial (generalized or derived) key may be used to guarantee that every row in the table is unique. The warehouse data may likely be a combination of many transformed records, of which there are no natural data keys to use as unique identifiers. Concatenate Operational Key with a Number Your postprocessing program executes the create index commands and allocates the key values, which may be a concatenation of the primary key and version digit or characters. For example, if a customer record key value contains six digits, such as 109908, the derived key may be 10990801. The last two digits are the sequential number generated automatically. Advantage The advantage of this method is that it is relatively easy to maintain and set up the necessary programs to manage number allocation. Disadvantage The disadvantages of this method are that • The keys may become long and cumbersome. • There is no clean key value for retrieval of a record, unless you have another copy of the key. For example, if the operational Customer_Id is 109908 but the warehouse key is now 10990801, then extracting information about that customer from the warehouse using 109908 is impossible—unless the old value has been retained in another field such as: Customer_key Customer_id Customer_Name 10990801 109908 Acme Inc. Assign a Number from a List You can also assign the key sequentially from a simple list of numbers. A disadvantage of this method is that the keys therefore have no semantic or intuitive meaning. Metadata You must ensure the metadata is updated to register the latest key allocations. Verdict The option you choose depends upon the extract methods, the tools available, and the hardware and network capability and availability. ..................................................................................................................................................... Data Warehousing Fundamentals 12-29 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Creating Summary Tables • • CTAS pCTAS Summary data Copyright Oracle Corporation, 1999. All rights reserved. Filtering Data From warehouse to data marts • • CTAS pCTAS Summary data Warehouse Data marts Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-30 Data Warehousing Fundamentals Postprocessing of Loaded Data ..................................................................................................................................................... Creating Summary Tables This course has already discussed why summary tables are useful to the data warehouse. • They provide immediate answers to queries, which improves query performance. • They save disk space. You can create summary data for old history for which detailed analysis is not required. After you perform initial user requirements analysis, you determine the summaries needed by the user. However, you must constantly monitor access, from which you may be able to determine new summaries that should be created and summaries no longer needed. You can create summaries by using: • CREATE TABLE AS SELECT (CTAS), or • CREATE TABLE AS SELECT... PARALLEL (pCTAS) Filtering Data You may filter out specific information to supply subject-specific data for dependent data marts. The filtering uses simple SQL to create new objects using existing objects. The new objects are then moved into the data mart, similar to the way data is moved into the warehouse. You can perform this filtering task using: • CREATE TABLE AS SELECT (CTAS), or • CREATE TABLE AS SELECT... PARALLEL (pCTAS) ..................................................................................................................................................... Data Warehousing Fundamentals 12-31 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Verifying Data Integrity • • Load data into intermediate file Compare target flash totals with totals before load Load Counts and amounts Flash totals File 1 = Load File 1 File 2 Intermediate file != Warehouse File 2 Preserve, inspect, fix, then load Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-32 Data Warehousing Fundamentals Postprocessing of Loaded Data ..................................................................................................................................................... Verifying Data Integrity It is important at all stages of ETT that errors be detected, flagged, and resolved. How you verify data integrity depends upon whether you have a customized approach to ETT or whether you employ an ETT tool, which will probably deal with these issues automatically, and only allow you to visibly access the data when available in the warehouse. It is important to ensure that each load, whether first time or a refresh, executes successfully. You need to create jobs that track: • The status of the warehouse load, whether it has started, is in progress, or complete • When the process completes • Statistics to show load start and complete time, and records processed in order to monitor and ensure continuing efficiency • Comparison of load control counts and amounts: – You must be aware of the amounts of data that are to be loaded, so that you can perform an accurate validation of completeness. – You can load the detail and summary records into intermediate files, to compare counts and amounts created before loading with counts and amounts (flash totals) derived on the target data warehouse. • Data reconciliation issues • Referential integrity violations • Any failures that require reprocessing ..................................................................................................................................................... Data Warehousing Fundamentals 12-33 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Steps for Verifying Data Integrity 3 Source files 4 Control 1 SQL*Loader Extract 6 2 5 .log .bad 7 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-34 Data Warehousing Fundamentals Postprocessing of Loaded Data ..................................................................................................................................................... Steps for Verifying Data Integrity You may find it useful to load the detail and summary records into intermediate files, so that you can compare record counts and sample totals before loading on the target data warehouse. If the counts and totals do not match, you must preserve and inspect the intermediate files without loading and compromising data warehouse data integrity. Example In the diagram, you see that the source data is coming in from a number of files. 1 The control and extract process queries and downloads the data, and appends a row (either a row count value or a phony row of unique data). 2 The process generates a report indicating the data extraction information, such as the number of rows downloaded, the number of bytes in the file, and the query statement. 3 The process puts the extracted data into a flat file. 4 SQL*Loader loads the data into a database table. 5 The conversion and loading process generates a loader log to track the same type of information as the extract report: the number of rows downloaded, the number of bytes contained in the file, and conversion details. 6 At the end of the load process, the SQL*Loader script removes the last record of the flat file and puts it into a filename.bad file, which contains the row count or phony record of data that was added by the extraction process. 7 A UNIX script compares the mainframe report and the loader log to see if they contain the same information. The script may also look at the.bad file to determine if the correct last row of data was removed from the loading process. If the reports match and the data in the.bad file is correct, then the loading process is deemed successful. If you are writing a custom mechanism, embed a set of rows into the data so that verification is easier. You can query for the embedded data to see that all rows are loaded. Your routine may also display messages, which are embedded in the load routine, or send an e-mail. ..................................................................................................................................................... Data Warehousing Fundamentals 12-35 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Standard Quality Assurance Checks • • • • • • • Load status Completion of the process Completeness of the data Data reconciliation Violations Reprocessing 1+1= 3 Comparison of counts and amounts Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-36 Data Warehousing Fundamentals Postprocessing of Loaded Data ..................................................................................................................................................... Standard Quality Assurance Checks The following tasks are standard quality assurance checks for the data loaded into the warehouse: • Status of the warehouse load • Completion of the load process • Completeness of the data • Data reconciliation • Referential integrity violations and reprocessing • Comparison of load control counts and amounts ..................................................................................................................................................... Data Warehousing Fundamentals 12-37 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Summary This lesson discussed the following topics: • • • First-time load considerations Techniques for transporting data Tasks involved in the postload processing stage Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-38 Data Warehousing Fundamentals Summary ..................................................................................................................................................... Summary This lesson discussed the following topics: • Tasks involved with first-time loading of data into the warehouse • Techniques for transporting data • Tasks involved in the postload processing stage ..................................................................................................................................................... Data Warehousing Fundamentals 12-39 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... Practice 12-1 Overview This practice covers the following topics: • • Identifying a series of statements as true or false Answering a series of questions Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 12-40 Data Warehousing Fundamentals Practice 12-1 ..................................................................................................................................................... Practice 12-1 1 Assemble into small groups of 3 or 4. Discuss and compare the factors that will determine the load window where you work. Consider user requirements, operational constraints, and staffing issues. 2 Identify whether the following statements are true or false. Question True False Transportation of data involves moving the data into the data warehouse database. An example of high level grain data is summarized data. SQL*Loader is the fastest way to move data into the data warehouse database. Gateways are useful for moving large amounts of data into the warehouse. Data for the data warehouse is always indexed after it is loaded. The quickest way to create unique indexes on warehouse data is to leave database constraints enabled on load. Summary tables are created on the warehouse server. Filtering removes unwanted records from staging files. 3 Name the two different types of data loading. _____________________ _____________________ 4 Name four methods of moving data to the warehouse server. _____________________ _____________________ _____________________ _____________________ 5 What SQL command is used to create summary tables on the data warehouse server? ________________________________________________________________ 6 What server technique can be used to prevent and allow access to data in the warehouse after refresh? ________________________________________________________________ ..................................................................................................................................................... Data Warehousing Fundamentals 12-41 Lesson 12: Transportation: Loading Warehouse Data ..................................................................................................................................................... ..................................................................................................................................................... 12-42 Data Warehousing Fundamentals 13 ................................. Transportation: Refreshing Warehouse Data Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Choosing a Computing Architecture Meeting a Business Need Modeling the Data Warehouse Analyzing User Query Needs Planning Warehouse Storage ETT ETT (Building (Building the the Warehouse) Warehouse) Managing the Data Warehouse Supporting End User Access Project Management (Methodology, Maintaining Metadata) Copyright Oracle Corporation, 1999. All rights reserved. Objectives After completing this lesson, you should be able to do the following: • • • • Describe methods for capturing changed data • List tools for transporting data into the warehouse Explain techniques for applying the changes Discuss techniques for purging and archiving data Outline final tasks, such as publishing the data, controlling access, and automating processes Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-2 Data Warehousing Fundamentals Overview ..................................................................................................................................................... Overview In the last lesson, you examined the first time load of the warehouse. In this lesson, you examine methods for updating the warehouse with changed data, after the first time load. Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page. Objectives After completing this lesson, you should be able to do the following: • Describe methods for capturing changed data • Explain techniques for applying the changes • Discuss techniques for purging and archiving data • Outline final tasks, such as publishing the data, controlling access, and automating processes • List tools for transporting data into the warehouse ..................................................................................................................................................... Data Warehousing Fundamentals 13-3 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Developing a Refresh Strategy for Capturing Changed Data • • • • • • Consider load window Identify data volumes Identify cycle Know the technical infrastructure Plan a staging area Determine how to detect changes Operational databases T1 T2 T3 Copyright Oracle Corporation, 1999. All rights reserved. User Requirements and Assistance • • • • Users define the refresh cycle IT balances requirements against technical issues Document all tasks and processes Employ user skills Operational databases T1 T2 T3 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-4 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Capturing Changed Data You must have a strategy for maintaining changes to the data warehouse, including changes to facts, dimension data, and summary data. There are no concrete rules about when the data warehouse should be refreshed, but there are several factors to consider: • Total load window available • The volume of data to be transferred • How often does the warehouse data need to be updated? When are you going to move the data? Will you refresh monthly, weekly, or at another time interval? Will you use continuous refresh for nearly real-time data? • The connectivity gear available for moving the data into the data warehouse. How are you going to move the data? Will you move data in batch mode, which is feasible for less time-critical applications? • Are you going to move data from operational systems to an intermediate area? Is this area an operational data store? Is it a flat file? Is it an Oracle database? Or is it something completely unique to your implementation? • How are changes in data to be detected? Are you going to push the changes through when detected? Are you going to pull the changes in? Where are you going to store the changes? Could you use triggers to force changes into an alternative store? User Requirements and Assistance The strategy is primarily defined by user requirements, but they must be balanced against the available technology and windows for loads. All must be documented and understood by everyone involved in the project. The users can also provide expertise for load verification, validation, run-to-run, and load controls. ..................................................................................................................................................... Data Warehousing Fundamentals 13-5 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Load Window • • • • • Time available for entire ETT process Plan Test Prove Monitor Load Window 0 3 am 6 User Access Period Load Window 9 12 pm 3 6 9 12 Copyright Oracle Corporation, 1999. All rights reserved. Load Window • • • • • • Plan and build processes according to a strategy. Consider volumes of data. Identify technical infrastructure. Ensure currency of data. Consider user access requirements first. High availability requirements may mean a small load window. User Access Period 0 3 am 6 9 12 pm 3 6 9 12 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-6 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Load Window The load window is simply the amount of time you have available to extract, transform, load, postload process data, and make the data warehouse available to the user. The load performs many sequential tasks that take time to execute. You must ensure that every event that occurs during the load window is planned, tested, proven, and constantly monitored. The effect of poor load performance is to extend the load time and prevent users from accessing the data when it is needed. Careful planning, defining, testing, and scheduling is critical. Load Window Strategy The load time is dependent upon a number of factors, such as data volumes, network capacity, and load utility capabilities. You must not forget that the aim is to ensure the currency of data for the users, who require access to the data for analysis. To work out an effective load window strategy, consider the user requirements first, and then work out the load schedule backward from that point. Determining the Load Window It is usual to define the user access requirements first and work the load schedule backward from that point. Once the user access time is defined, you can establish the load cycles. Some of the processes overlap to enable all processes to run within the window. More realistically, almost twenty-four-hour access is required. This means the load window is significantly smaller than the example shown here. In that event, you need to consider how to process the load and keep users presented with current realistic data. This is where you can use partitioning strategies. ..................................................................................................................................................... Data Warehousing Fundamentals 13-7 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Scheduling the Load Window 1 Requirements 2 Load cycle Receive data File 1 FTP File 2 0 3 Control File File names File types Number of files Number of loads First-time load or refresh Date of file Date range Records in file - counts Totals - amounts Control 4 process Open and read files to verify and analyze 3 am Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-8 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Scheduling the Load Window From the example you can see that the transportation of data (that is, moving the data to the server and loading into the warehouse tables) is a complex task involving many steps. To work out an effective load window strategy, consider the user requirements first, and then work out the load schedule backward from that point. Example of Scheduling the Load Window 1 Determine when the users require the data. If the working hours are between 9 a.m. and 5 p.m., you allow them access during that period. 2 Once the user data-access time is defined, you can establish the load cycle. The load cycle may need to access different extract files, or a different number of extract files, each time the load is performed. You may need to split the cycle into a series of loads using one file at a time. 3 You create a control file to manage every load, or series of loads. Remember that the first-time load is different from refreshes, and that for each refresh the files and number of files may differ. The control file contains information such as the: – File name and type – Date of the file – Number of records in the file – Date range for the data in the file – Counts of records and totals so that the data load can be verified 4 The control process is an active process that waits for the files named in the control file to be received. The number and names of these files vary among loads. Files are usually transferred using File Transfer Protocol (FTP) techniques. The control process does not pass to any other process until all files are received and it has opened and read count and amount data to be used for load verification and analysis. Note: The time 0 identified on the slides denotes 00:00 Zulu, which is midnight. ..................................................................................................................................................... Data Warehousing Fundamentals 13-9 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Scheduling the Load Window 6 Verify, analyze, reapply 5 Load into warehouse 7 Index data 8 Create summaries 9 Update metadata File 1 File 2 Parallel load 3 am 6 am 9 am Copyright Oracle Corporation, 1999. All rights reserved. Scheduling the Load Window 11 10 Create views for specialized tools Back up warehouse 12 Users access summary data 13 Publish User access 6 am 9 am Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-10 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Example of Scheduling the Load Window (continued) 5 The data is then loaded into the warehouse. 6 Each load requires verification and analysis (and maybe reanalysis once any load exceptions are reapplied). You need to ensure that the data is successfully loaded by performing checks against the row counts and amounts available in the control files. Any loading errors yielding potentially bad data need to be reapplied. This adds time to the load, and contingency should be built into the cycle to cope with this. If you are using SQL*Loader to move the data, the bad data resides in a file called <filename>.bad. 7 Indexes are constructed. 8 Summarization takes place. 9 Metadata is updated to ensure it contains information about the current load. 10 The warehouse is backed up. With many database servers today, there are typically two mechanisms for backup: hot, with users online, and cold, with users offline. You should consider cold backups before user access. The backup should include: – All warehouse data – Summary tables – Database schema – Metadata Note: If the information is supplied to the warehouse on tape, a full cold backup may not be necessary. The summaries created at the target server may be all that you need to back up. 11 Create the views required by specialized user tools, such as Oracle Express RAM/RAA. 12 Give users access to the summary data. 13 Publish information to the users, specifying the changes to the data warehouse and allowing them access. Note: These steps identify one solution and assume that summarization and indexing occur after load, and that the job is executed from a batch file. ..................................................................................................................................................... Data Warehousing Fundamentals 13-11 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Capturing Changed Data for Refresh • • • • Capture new fact data Capture changed dimension data Determine method for capture of each Methods: – Wholesale data replacement – Comparison of database instances – Time stamping – Database triggers – Database log • Hybrid techniques Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-12 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Capturing Changed Data for Refresh There are two major categories of changed data: • New fact data • Changed dimension data For each, a different capture mechanism will be discussed. In addition, consider how you will process the load. The fact might easily be loaded by adding another partition of data, a relatively straightforward process (for a database administrator). Changes to dimension data need more selective update. You need to evaluate whether the change is to replace or add to an existing record, or whether you want to maintain history (keeping old and new records). • For example, the description of a product may change over its lifetime, even if its primary (and unique) part number remains the same. It is important to see that change reflected. • Another common example is sales districts in a sales organization that reorganizes. Methods There are a number of ways to capture changes to data. Consider which is the most efficient for your individual circumstances: • Wholesale data replacement • Comparison of database instances • Time and date stamping • Database triggers • Database log Note: All methods identified here are possible with Oracle server and associated facilities and utilities. ..................................................................................................................................................... Data Warehousing Fundamentals 13-13 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Wholesale Data Replacement Operational databases T1 • • • • T2 T3 Expensive Limited historical data, if any Data mart implementations Time period replacement Copyright Oracle Corporation, 1999. All rights reserved. Comparison of Database Instances Yesterday’s operational database Today’s operational database Database comparison Delta file holds changed data • Simple to perform, but expensive in time and processing • Delta file: – Changes to operational data since last refresh – Used by various techniques Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-14 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Wholesale Data Replacement This method refreshes the entire warehouse in every business cycle. This method is understandably very expensive. Every refresh needs to extract, transform, and transport the entire warehouse. In fact, this method is similar to using a first-time load on a regular basis. Some data mart and online analytical processing server implementations use this method because they hold less data (a subset of the data warehouse), and wholesale replacement is less complex and less expensive than programming mirroring and update procedures. Issues The time window required for wholesale replacement can often exceed the time that the data is contracted to be offline (and unavailable to the users). However, with mirroring strategies users can be directed to an image copy of the data warehouse while maintenance is being performed. The changes that occur during the maintenance cycle must be applied to the current online image (production version). The production version should then be backed up or mirrored. Historical data analysis is limited, because you are restricted by the sheer volume of data loaded each time. Comparison of Database Instances In this method, you capture the differences between two instances of the same database, to find out the changes that have occurred since the last time the data warehouse was refreshed. The changes are held in an intermediate (or delta) file and are used to update the warehouse. Issues It is a simple but an expensive way to determine changes. It works more efficiently and effectively if the volumes of data are small, as with wholesale replacement. Delta File or Database The delta database (or file) contains only the changes that have been made to the operational system since the last refresh. An operational application may need to be modified to create the delta file structure and contain the new logic that captures changes and adds the rows to the delta file. ..................................................................................................................................................... Data Warehousing Fundamentals 13-15 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Time and Date Stamping Delta file holds changed data Operational data • Fast scanning for records changed since last extraction • • Date Updated field No detection of deleted data Copyright Oracle Corporation, 1999. All rights reserved. Database Triggers Operational server (DBMS) Operational data Trigger Trigger Delta file holds changed data Trigger Triggers on server • • • Changed data intersected at the server level Extra I/O required Maintenance overhead Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-16 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Time and Date Stamping A time and date stamp on changed data quickly shows you the data that has been changed since the last refresh cycle. The time and date stamp is normally part of a key value, making it an efficient way to search and find changed data. The advantage of this approach is that the process that creates the delta database only needs to look at the time key and identify the records with the required time and date stamp. Depending upon the frequency of refresh and the mechanism chosen for time and date stamping, the search for the time value may be a specific date, for example, all Time_Key = ‘01-jan-97’, or a date range such as Time_Key BETWEEN ‘01-jan97’ and ‘07-JAN-97’, or Time_Key LIKE ‘%jan-97’. Issues You can use this method only if the database contains a Date Updated field, which may not be the case in many operational systems. This is one issue that may be resolved by reengineering source system applications or database server code. You might add database triggers to perform the updates. Note: Time and date stamps do not catch deleted data. Database Triggers Procedural code in database triggers captures and identifies changed data at the database level. Extra I/O is required while the system is online to track changes as they occur and maintain a delta file if needed. Issues You must modify the database to add server (DBMS) triggers that capture before and after images of the records. The triggers and associated code—PL/SQL, if using Oracle—write the changes to a delta database or file. Of course, to use this method, the server must support database triggers. ..................................................................................................................................................... Data Warehousing Fundamentals 13-17 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Using a Database Log Operational server (DBMS) Operational data • • • Log Log analysis and data extraction Contains before and after images Delta file holds changed data Requires system checkpoint Common technique Copyright Oracle Corporation, 1999. All rights reserved. Verdict • • Consider each method on merit. • Consider current technical, existing operational, and current application issues. Consider a hybrid approach if one approach is not suitable. Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-18 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Using a Database Log A log file contains information from which you can extract changed data; it logs “before” and “after” images of the data. You may analyze the log file in batch mode to identify the differences that become the delta file. Issues • The format of the log file may be difficult to interpret and use. • The log tape is not really intended for use by the warehouse, and often contains a lot of data not required by the warehouse. • The system must wait for a checkpoint in order to get a stable log. This is a process that many ETT tools use, but it can be done only on databases that provide a log, such as Oracle and DB2. Note: Oracle snapshot and replication facilities log changes into another table. Verdict Each of these mechanisms has its good and bad points. In reality, your data warehousing environment might actually use a combination of these mechanisms. For example, you might: • Time-stamp changed dimension data, and • Simply extract data that exists within a database partition for the new fact data, but use • Wholesale replacement to supply your dependent data marts with updated data. The choice you make is based on the many factors identified earlier in this lesson. ..................................................................................................................................................... Data Warehousing Fundamentals 13-19 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Applying the Changes to Data You have a choice of techniques: • • • • • Overwrite a record Add a record Add a field Maintain history Add version numbers Copyright Oracle Corporation, 1999. All rights reserved. Overwriting a Record Customer Id John Doe Single ..................................................................., ...............................................................,.... Customer Id John Doe Married ...................................................................... ...................................................................... • • • Easy to implement Loses all history Not recommended Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-20 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Applying the Changes to Data There are a number of methods for managing changes to existing data in dimension tables: • Overwrite a record • Add a new record • Add a current field • Maintain history records • Versioning of records Overwriting a Record This method is easy to implement, but it is useful only if you are not interested in maintaining the history of data. If the data you are changing is critical to the context of information and analysis of the business, then overwriting a record is to be avoided at all costs. For example, by overwriting dimension data, you lose all track of history—you can never see that John Doe was single if the value “Single” is overwritten with the value “Married” from the operational system. The Customer_Id for John Doe remains constant throughout the life of the warehouse, because only one record for John Doe is stored. ..................................................................................................................................................... Data Warehousing Fundamentals 13-21 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Adding a New Record 1 Customer Id John Doe Single 1 Customer Id John Doe Single 1A Customer Id John Doe Married • • • • History is preserved; dimensions grow. Time constraints are not required. Generalized key is created. Metadata tracks usage of keys. Copyright Oracle Corporation, 1999. All rights reserved. Adding a Current Field • • • Customer Id John Doe Single Customer Id John Doe Single Married 01-JAN-96 Maintains some history Loses intermediate values Is enhanced by adding an Effective Date field Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-22 Data Warehousing Fundamentals Capturing Changed Data ..................................................................................................................................................... Adding a New Record Using this method, you add another dimension record for John Doe. One record shows that he was “single” until December 31, 1995, another that he was “married” from January 1, 1996. Using this method history is accurately preserved, but the dimension tables get bigger. • A generalized (or artificial) key is created for the second John Doe record. • The generalized key is a derived value that ensures that a record remains unique. However, you now have more than one key to manage. • You also need to ensure that the record keeps track of when the change occurred. The Customer_Id for John Doe does not remain constant throughout the life of the warehouse, because each record added for John Doe contains a unique key. The key value is usually a combination of the operational system identifier with characters or digits appended to it. Consider using real data keys. The example here shows a method that is commonly identified in warehouse reference material. Adding a Current Field In this method, you add a new field to the dimension table to hold the current value of the attribute. Using this method, you can keep some track of history. You know that John Doe was “single” before he was “married”. Each time John’s marital status changes, the two status attributes are updated and a new Effective Date is entered. However, what you cannot see from this method is what changes have taken place between the two records you are storing for John Doe—intermediate values are lost. • Consider using an Effective Date attribute to show when the status changed. • Partitioning of data can then be performed by effective date. The method you choose is again determined by the business requirements. If you want to maintain history, this method is a logical choice that can be enhanced by using a generalized key. ..................................................................................................................................................... Data Warehousing Fundamentals 13-23 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Limitations of Methods for Applying Changes • • • Complete history impossible Dimensions may grow large Maintenance overhead 1234 Comer 1234 Comer 1 Main Street 200 First Ave 555-6789 222-3211 1234 Comer 123401 Comer 1 Main Street 200 First Ave 555-6789 222-3211 1234 Comer 123401 Comer 1 Main Street 200 First Ave Effective Date 555-6789 01-Apr-93 222-3211 01-Jun-97 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-24 Data Warehousing Fundamentals Limitations of Methods for Applying Changes ..................................................................................................................................................... Limitations of Methods for Applying Changes Assume a customer record as follows: Custid 1234 Name Comer Address 1 Main Street Phone 555-6789 If you overwrite the record, history is lost, and there is no record of this company ever existing at 1 Main Street. Custid 1234 Name Comer Address 200 First Ave Phone 222-3211 You may add a record and create a generalized key to identify the row uniquely. However, this method may make the dimension large and unmanageable and you have lost that customer’s unique identifier. Custid 1234 123401 Name Comer Comer Address 1 Main Street 200 First Ave Phone 555-6789 222-3211 You also have to duplicate the fields for this customer that have not changed into the record with the new generated key, which adds to the maintenance burden. You may add a current field and create a generalized key to uniquely identify the row: Custid 123401 Name Comer Address 200 First Ave Phone 555-6789 Effective Date 01-jun-97 In this situation, you know that 200 First Ave. is the current address, but you have no way of knowing the previous address details. ..................................................................................................................................................... Data Warehousing Fundamentals 13-25 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Maintaining History HIST_CUST CUSTOMER Time Sales One-to-many relationship • • Product Always retain current record Consistently able to refer to record history Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-26 Data Warehousing Fundamentals Limitations of Methods for Applying Changes ..................................................................................................................................................... Maintaining History Another alternative is to use history tables, which involve normalizing the dimensions to hold current and historical data. Oracle consultants engaged in data warehouse implementations have found this method to be a more comprehensive, effective, and easily managed solution. One-to-Many Relationship Using this method, you keep one current record of the customer and many history records in the customer history table (a one to many relationship between the tables), thus maintaining history in a more normalized data model. The table below shows you how the data might appear. In the CUSTOMER table the customer operational unique identifier is retained in the CUSTOMER.Id column. In the HIST_CUST table, the operational key is maintained in the HIST_CUST.Id column and the generalized key in the HIST_CUST.G_id column. This enables you to keep all the keys needed and multiple records for the customer. CUSTOMER. Id 1234 HIST_CUST. Id 1234 4567 5678 4567 5678 HIST_CUST. G_id 1234 1234A 1234B 4567 5678 5678A 5678B The CUSTOMER table may contain full details for each customer; however, it could contain only the key values, leaving the full details (including text descriptions) in the HIST_CUST table. ..................................................................................................................................................... Data Warehousing Fundamentals 13-27 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... History Preserved • • • • History enables realistic analysis. History retains context of data. History provides for realistic historical analysis. Model must be able to: – Reflect business changes – Maintain context between fact and dimension data – Retain sufficient data to relate old to new Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-28 Data Warehousing Fundamentals Limitations of Methods for Applying Changes ..................................................................................................................................................... History Preserved This method completely preserves history and is therefore very effective for performing analysis over time where data has changed substantially. The context of information is still preserved. A good example of where this applies is in a sales organization. Assume that you have a model containing a sales fact and dimensions such as Customer, Sales Region, and Product. Your warehouse contains sales figures for sales region Europe for the years 1992 and 1993. In 1994, the European region reorganizes and splits into East Europe and West Europe. The warehouse is now maintaining data for each region from 1994 onward. In 1997, users are asked to put together some projections based on the last five years’ sales in Europe. The data you are currently using for East and West Europe for 1992 and 1993 does not have the data split this way. That is not an issue because you still have the ability to roll up East and West regions into a total for Europe, and perform analysis over a five-year period. If we reverse the scenario, two regions become one and the solution is the same. The issue with retaining history and context is building a model that is able to: • Reflect changes as the business changes • Keep the context of information accurate between dimension and fact data • Retain sufficient data to be able to relate old and new records where needed ..................................................................................................................................................... Data Warehousing Fundamentals 13-29 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Version Numbering • • Avoid double counting Facts hold version number Customer Time Sales Customer.CustId Version 1234 1 1234 2 Customer Name Comer Comer Sales.CustId 1234 1234 Sales Facts 11,000 12,000 Version 1 2 Product Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-30 Data Warehousing Fundamentals Limitations of Methods for Applying Changes ..................................................................................................................................................... Version Numbering You can also maintain a version number for the customer in the Customer dimension: Custid 1234 1234 Name Comer Comer Address 1 Main Street 200 First Ave Version 1 2 You must ensure that the measures in the fact table, such as sales figures, also contain the customer version number to avoid any double counting: Custid 1234 1234 1234 1234 1234 1234 1234 Version 1 2 1 1 2 2 1 Sales $ 11,000 12,000 5,000 10,000 45,000 30,000 10,000 For Comer Version 1, the sales total is $36,000. For Comer Version 2, the sales total is $87,000. ..................................................................................................................................................... Data Warehousing Fundamentals 13-31 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Purging and Archiving Data • • As data ages, its value depreciates. Remove old data from the warehouse: – Archive for later use – Purge without copy Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-32 Data Warehousing Fundamentals Purging and Archiving Data ..................................................................................................................................................... Purging and Archiving Data Data may reside in the warehouse for many more years than it would in an operational system; however, it does not remain forever. The value of data to the business diminishes over time. During analysis, the analysts determine the useful life span of the data. In addition, old data may simply be summarized; the detail is not needed. What Is Purge? If there is no chance of ever needing the data again, even for summaries, then you can purge it. This removes the data entirely; no copy is retained. What Is Archive? If you feel you may need the data in the future—to build summaries, for example— then archive the data to low-cost storage devices that are not associated with the data warehouse. Your Role You need to ensure that you have the strategies in place that meet determined business requirements for purge and archive. ..................................................................................................................................................... Data Warehousing Fundamentals 13-33 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Techniques for Purging Data • • • • TRUNCATE: Retains no rollback DELETE: Retains redo and rollback ALTER TABLE: Removes a partition PL/SQL: Uses database triggers Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-34 Data Warehousing Fundamentals Purging and Archiving Data ..................................................................................................................................................... Techniques for Purging Data TRUNCATE Command The SQL TRUNCATE command is the quickest way to purge data. It does not retain redo data and rollback is impossible. It is also useful for emptying a temporary table that is used repeatedly as part of a regular load or summary process. Indexes on the table are also truncated. DELETE Command The SQL DELETE command is used if the data has not been partitioned. DELETE retains redo information, so you need to size the rollback segments carefully. NOLOGGING does not apply to DELETE or UPDATE. DELETE works only in parallel on partitioned tables. Oracle8 syntax enables you to delete rows from a partition. When you delete rows from a table, the corresponding entries in every index on the table must also be deleted. This has a performance impact. ALTER TABLE Command Given that your warehouse data is commonly partitioned by time, you can simply remove a partition containing old data. PL/SQL Triggers Where there are special requirements and low volumes of data, you can use PL/SQL and the ON DELETE database trigger. This is, however, an expensive option. ..................................................................................................................................................... Data Warehousing Fundamentals 13-35 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Techniques for Archiving Data • • • Export to dump file from tables Import to tables from dump file ALTER TABLE EXCHANGE partitions EXP Database .dmp IMP Copyright Oracle Corporation, 1999. All rights reserved. Verdict • • Defined by business requirements Must be managed Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-36 Data Warehousing Fundamentals Purging and Archiving Data ..................................................................................................................................................... Techniques for Archiving Data Import and Export Utilities The export utility enables you to move data from tables to a dump file (called filename.dmp). The import utility can then read that dump file and load data back into the same or another user. You can export in two ways: • A conventional path export uses a conventional SELECT statement to extract table data which is held for a short time in an evaluation buffer. Once evaluated, it is transferred to the dump file. • A direct path export does not use the evaluation buffer. ALTER TABLE You can also switch a partition of data with an empty table, drop the empty partition, and export the table. Archive the exported table when you have time. Verdict The method you employ depends upon your individual business requirement, although the history model is a popular choice in the current warehousing environment. You must ensure that someone in the data warehouse administration is responsible for managing and tracking these changes. ..................................................................................................................................................... Data Warehousing Fundamentals 13-37 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Final Tasks Update metadata – ETT Browser: Cus http:// tom Hollywood X + s: er Cus Browser: Browser: to h ttp:// http:// mer +X Hol lywood Hollywoo d s: X + a reco rof as • – User Sources • Publish data Stage Extract Rules Transform Publish Load Query – Availability – Changes – Subject area basis • Use database roles to prevent and allow access Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-38 Data Warehousing Fundamentals Final Tasks ..................................................................................................................................................... Final Tasks Update the Metadata Once your data has been loaded successfully, ensure that the metadata is updated. You need to consider many aspects, including information about the processes themselves. The most important aspect at this time is to ensure that the metadata reflects the new information available. Users must be made aware of the changes, for example, of the validity of data, date of data, any new data available, revised summaries, removed summaries, new algorithms, and the new meaning of values. Publish Data So that users are presented with a consistent view of the data, ensure that user access is denied while the ETT processes are executing. You should allow access only when all tasks are complete, validation has occurred, and metadata updated. You may choose to do this on a subject area basis, user basis, or for the entire warehouse. Again, like many other tasks, this is dependent upon your individual data warehouse or data mart implementation. Accessing the Refreshed Warehouse With Oracle, using roles and granting and revoking privileges is the simplest method of preventing and allowing access. You may advise the users that the warehouse is available by internal e-mail mechanisms. Alternatively, if you have strict service level agreements (SLAs) that state users must have access from, say, 8:00 a.m. every working day, then advice may not be needed. You could e-mail or advise only if the warehouse is not available, because of some unforeseen problems. ..................................................................................................................................................... Data Warehousing Fundamentals 13-39 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Publishing Data • • • • Control access using database roles 24-hour operation may be requested Compromise between load and access Consider – Staggering updates – Using temporary tables – Using separate tables Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-40 Data Warehousing Fundamentals Final Tasks ..................................................................................................................................................... Publishing Data The term “publishing data” is used to describe when the data is loaded and made available to the users. As a rule you prevent access to the data while the load process is active, to ensure that the users are presented with an accurate view of data and summaries. If service level agreements state that users require access virtually 24 hours a day, then revoking and granting access as discussed is not appropriate. You need to consider how you can perform the load action while still allowing access, and ensuring that the data is as consistent as possible. There are different techniques depending upon the availability needs of the users. • Stagger the updates to the different subject areas. Update on different nights of the week (say Tuesday and Wednesday) even though the revised source data might be made available days earlier. • Use temporary tables (that the users cannot access) for load, filtering, summarizing. Make the database unavailable only for the short time it takes to instantiate these as permanent objects. • Load the data into a separate table and perform all the processing required. These actions are invisible to the user. Then when all tasks are complete, swap the contents of the temporary table into a database partition. The same technique is employed for the indexes. Note: With Oracle7, the partition is a view. In Oracle8, this is a partitioned table. ..................................................................................................................................................... Data Warehousing Fundamentals 13-41 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... ETT Tool Selection Criteria • • • • • • • • • • • Overlap with existing tools Availability of meta model Supported data sources Ease of modification and maintenance Required fine tuning of code Ease of change control Power of transformation logic Level of modularization Power of error, exception, resubmission features Intuitive documentation Performance of code Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-42 Data Warehousing Fundamentals Selecting ETT Tools ..................................................................................................................................................... Selecting ETT Tools Consultants in the field suggest that the selection criteria for ETT tools include the following considerations: • The overlap with existing tools used in the warehouse development, such as Oracle Designer or other modeling tools • The availability of the metamodel to other tools or the use of the metamodel from other tools • The breadth of data sources supported and target data coverage, such as flat files, character formats, and database types • The mechanism for and ease of defining and altering rules when there is possibly a mixed set of users managing ETT, such as analysts and endusers • The requirement to maintain generated code manually Some vendors advise there is no need to modify the generated code; however, you may need to fine-tune it. Do you have the in-house expertise to modify the generated code, for example, C or COBOL? • The control of changes to transformation rule definitions and the ability to handle development and production versions of transformation rules • The depth, power, and ease of use for the transformation logic; for example, conditional logic, data value filters, row and set-oriented processing, local variables, and input parameters • The reuse and modularization of existing transforms and filters • Error reporting, rejected records, and resubmission capabilities You need to be able to trap and correct bad data before it is loaded into the warehouse and report corrections to the source system afterward. • The self-documenting ability If the tool is text-based, and not intuitive to navigate, you are going to find it difficult to get the entire picture of the processing performed within the warehouse. A graphical design tool is desirable. • The performance of generated code ..................................................................................................................................................... Data Warehousing Fundamentals 13-43 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... ETT Tool Selection Criteria • • • • • • Activity scheduling and sophistication Metadata generation Learning curve Flexibility Supported operating systems Cost Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-44 Data Warehousing Fundamentals Selecting ETT Tools ..................................................................................................................................................... Selecting ETT Tools (continued) • Activity scheduling Can the tool schedule actions to happen and retry if the source or target is not available? Can it report what it has done? • Scheduling sophistication Can it schedule based on time of day, time since last try, time since last success, and time period regardless of last attempt? • Metadata generation by the transformation tool Generated metadata should be intuitive and easily understood by the business user. • The learning curve of the tool • The flexibility of the tool • The operating system under which the tool runs Is it supported on all the platforms that you will use for the ETT process? • Cost ..................................................................................................................................................... Data Warehousing Fundamentals 13-45 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Transportation Tools • • Informatica OpenBridge Oracle SQL*Loader Gateways PL/SQL Precompilers • Platinum Technology InfoPump Platinum Info Transport Copyright Oracle Corporation, 1999. All rights reserved. Replication Server Utilities • Oracle Symmetric and Heterogeneous Replication Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-46 Data Warehousing Fundamentals Selecting ETT Tools ..................................................................................................................................................... Transportation Tools WTI Partner Informatica Corp. Oracle Platinum Technology, Inc. Product OpenBridge SQL*Loader—Direct Path, Direct Path in Parallel Transparent and Procedural Gateways PL/SQL Precompilers InfoPump Platinum Info Transport Replication Server Utilities WTI Partner Oracle Product Symmetric and Asymmetric Replication Heterogeneous Replication ..................................................................................................................................................... Data Warehousing Fundamentals 13-47 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Gateways and Middleware • • • • • • • Brio Technology DataPrism Informatica Corporation OpenBridge Information Builders EDA/SQL Oracle Gateways Platinum Technology InfoHub Prism Prism Manager Software AG Entire Transaction Propagator Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-48 Data Warehousing Fundamentals Selecting ETT Tools ..................................................................................................................................................... Gateways and Middleware WTI Partner Brio Technology Informatica Corp. Information Builders, Inc. Oracle Platinum Technology, Inc. Prism Software AG of North America Product DataPrism OpenBridge EDA/SQL Oracle Open Gateways Procedural Gateways SQL*Loader InfoHub Prism Manager Entire Transaction Propagator ..................................................................................................................................................... Data Warehousing Fundamentals 13-49 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Summary This lesson discussed the following topics: • • • • Capturing changed data • Identifying tools for transporting data into the warehouse Applying the changes Purging and archiving data Publishing the data, controlling access, and automating processes Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-50 Data Warehousing Fundamentals Summary ..................................................................................................................................................... Summary This lesson discussed the following topics: • Capturing changed data • Applying the changes • Purging and archiving data • Publishing the data, controlling access, and automating processes • Identifying tools for transporting data into the warehouse ..................................................................................................................................................... Data Warehousing Fundamentals 13-51 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... Practice 13-1 Overview This practice covers the following topics: • • Identifying a series statements as true or false Answering a series of questions Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 13-52 Data Warehousing Fundamentals Practice 13-1 ..................................................................................................................................................... Practice 13-1 1 Identify whether the following statements are true or false. Statement True False The data refresh cycle is determined primarily by information technology staff input. The load window is the time that the IT group has dictated the data warehouse is available to the users for access Fact data frequently changes. Dimension data infrequently changes. 2 Name four different techniques for capturing the changes to operational data that is to be loaded into the warehouse. _____________________ _____________________ _____________________ _____________________ 3 Answer the following questions about updating dimension data. a What method of updating dimension data would you employ if you wanted to keep old and new records? b What relationship would that map to in an entity relationship model? 4 What server technique can be used to prevent and allow access to data in the warehouse after refresh? ..................................................................................................................................................... Data Warehousing Fundamentals 13-53 Lesson 13: Transportation: Refreshing Warehouse Data ..................................................................................................................................................... ..................................................................................................................................................... 13-54 Data Warehousing Fundamentals 14 ................................. Leaving a Metadata Trail Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Meeting a Business Need Choosing a Computing Architecture Planning Warehouse Storage Modeling the Data Warehouse ETT (Building the Warehouse) Analyzing User Query Needs Managing the Data Warehouse Supporting End User Access Project Project Management Management (Methodology, (Methodology, Maintaining Maintaining Metadata) Metadata) Copyright Oracle Corporation, 1999. All rights reserved. Objectives After completing this lesson, you should be able to do the following: • Define warehouse metadata, its types, and its roles in a warehouse environment • • Develop a metadata strategy • • List tools for managing metadata Describe in detail each type of warehouse metadata Describe the Oracle Common Warehouse Metadata architecture (CWM) Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-2 Data Warehousing Fundamentals Overview ..................................................................................................................................................... Overview Metadata has already been referenced a number of times in this course. It is critical to every phase of warehouse design and development. This lesson examines the role of warehouse metadata in greater detail. Note that the “Project Management (Methodology, Maintaining Metadata)” block is highlighted in the overview slide on the facing page. Objectives After completing this lesson, you should be able to do the following: • Define metadata, its types, and the main roles of metadata in a warehouse environment • Describe the challenges of managing warehouse metadata • List tools for managing metadata • Describe the Oracle Common Warehouse Metadata architecture (CWM) ..................................................................................................................................................... Data Warehousing Fundamentals 14-3 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Defining Warehouse Metadata • • • Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-4 Data Warehousing Fundamentals Defining Warehouse Metadata ..................................................................................................................................................... Defining Warehouse Metadata Data About Data Metadata is “data about data.” Warehouse metadata is descriptive data about warehouse data and the processes used in creating the warehouse. Warehouse metadata contains detailed descriptions of the location, structure, and meaning of data. It describes keys and indexes of the data. It contains mapping information, and it documents the algorithms and business rules used to transform and summarize data. Metadata is used throughout the warehouse, from the extraction stage through the access stage. Vital to the Warehouse A warehouse with poor metadata is analogous to a filing cabinet filled with folders stored in no particular order. It is very difficult to find your information in the cabinet. Used by Everyone Warehouse metadata is used directly or indirectly by everyone involved in creating, maintaining, or using the warehouse: database administrators, analysts, designers, and users. Warehouse metadata answers the following types of question: • What information is available, by subject area, and when did we start collecting that data? • How was this summarization created? • What queries are available to access the data? • What business assumptions have been made? • How do I find the data I need? • How old is the data? • What does that value mean? ..................................................................................................................................................... Data Warehousing Fundamentals 14-5 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... The Key to Understanding Warehouse Information • • • • • Specifies data location • • Provides a record of changes Manages data Aids use of information Describes the data The Key to Understanding Documents the development process Records enhancements over time Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-6 Data Warehousing Fundamentals Defining Warehouse Metadata ..................................................................................................................................................... Key to Understanding Warehouse Information Metadata is the component that holds all the information about the data in the warehouse, and presents it as information to the user. Data becomes and provides information if, and only if, you: • Have the data • Know you have it • Know where it is • Can access the data • Can trust the data Metadata is the key to understanding the warehouse. Metadata helps you locate, manage, and use warehouse information by: • Specifying the location of data • Managing data • Aiding the use of information • Describing the data • Documenting the development process • Providing a record of changes • Recording enhancements over time ..................................................................................................................................................... Data Warehousing Fundamentals 14-7 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Metadata Users IT developers Metadata repository ETT End user Operational Warehouse End users Copyright Oracle Corporation, 1999. All rights reserved. Types of Metadata • • • End user: – Key to a good warehouse – Navigation aid – Information provider ETT: – Maps structure – Source and target information – Transformations – Context Operational: – Load, management, scheduling processes – Performance Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-8 Data Warehousing Fundamentals Defining Warehouse Metadata ..................................................................................................................................................... Metadata Users In the warehouse, metadata is employed directly or indirectly by all warehouse users for many different tasks. End Users The decision support analyst (or user) uses metadata directly. The user does not have the high degree of knowledge that the IT professional has, and metadata is the map to the warehouse information. One measure of a successful warehouse is the strength and ease of use of enduser metadata. Developers For the developer, metadata contains information on the location, structure, and meaning of data, information on mappings, and a guide to the algorithms used for summarization between detail and summary data. Types of Metadata End User Metadata Enduser metadata describes the location and structure of data for user access. It describes data volumes and algorithms. Essentially, this is the floor plan that the knowledge worker uses to navigate through and around the data. ETT Metadata Extraction, transformation, and transportation metadata (sometimes called warehouse metadata or ETT metadata) maps the structure of source systems and how the data is to be transformed into its new format for the warehouse. It contains all the rules for extracting, scrubbing, summarizing, and transporting data. This is often the most difficult metadata model to construct. Operational Metadata Operational metadata is used by the load, management, and access processes for scheduling data loads or enduser access. It contains information about housekeeping activities, statistics of table usage, and information about every aspect of performance. Note: The Oracle Method has a specific process for metadata management. Enduser metadata is referred to as business metadata. ..................................................................................................................................................... Data Warehousing Fundamentals 14-9 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Developing a Metadata Strategy • Define a strategy to ensure high-quality metadata useful to users and developers. • Primary strategy considerations: – Define goals and intended use. – Identify target users. – Choose tools and techniques. – Choose the metadata location. – Manage the metadata. – Manage access to the metadata. – Integrate metadata from multiple tools. – Manage change. Copyright Oracle Corporation, 1999. All rights reserved. Defining Metadata Goals and Intended Usage • • • Define clear goals. Identify requirements. Identify intended usage. Metadata Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-10 Data Warehousing Fundamentals Developing a Metadata Strategy ..................................................................................................................................................... Developing a Metadata Strategy Like every other aspect of the data warehouse implementation, metadata should be the subject of a well-considered, well-planned strategy. You must ensure that the metadata is of a high quality, provides the right information to users and developers, and is able to take into account the various tools that employ metalayers. Integrating these layers is critical. Primary Considerations Among many other considerations, you need to resolve these key issues for the strategy: • Define the goals and intended use of the warehouse metadata. • Identify the target users of warehouse metadata. • Choose tools and techniques for creating and managing metadata. • Choose the metadata location. • Manage the metadata. • Manage access to the metadata. • Integrate multiple sets of metadata from different tools. • Manage changes to metadata. Defining Metadata Goals and Intended Usage Identify the intention of the metadata you develop. Outline main requirements such as maintaining history, context, and algorithms. ..................................................................................................................................................... Data Warehousing Fundamentals 14-11 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Identifying Target Metadata Users • Who are the metadata users? – Developers – End users • • What information do they need? How will they access the metadata? Copyright Oracle Corporation, 1999. All rights reserved. Choosing Metadata Tools and Techniques • Tools – Data modeling – ETT – End-user query and analysis • • • Database schema definitions COBOL copybooks Middleware tools Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-12 Data Warehousing Fundamentals Developing a Metadata Strategy ..................................................................................................................................................... Identifying Target Metadata Users Consider who, among both developers and end users, is to access metadata. What information do they need? Determine how they will access the metadata. Choosing Tools and Techniques Data Modeling Tools These tools are also known as computer-aided software engineering (CASE) tools. Oracle’s data modeling tool is Designer. Some of these tools are better than others at physically modeling metadata. Consider using a tool which either is specifically designed to model warehouse features or is extensible. For example, can the tool model a star or a snowflake? ETT Tools Tools for extracting, transforming, and transporting data into a warehouse also generate metadata. These tools are expensive purchases, and may not be employed for the first iteration during development. However, these tools have the advantage of being able to create and maintain a metadata layer. The tool itself must have all the information to take source data to the warehouse, so it is logical that the tool itself contains this layer. End User Tools Some tools for query and analysis allow the administrator to create a metadata layer, which describes the structure and content of the data warehouse. An administrator must consider a maintenance issue with tool metadata; for each query tool you need to create a unique layer. Database Schema Definitions Database schema definitions in a relational database management system offer another potential source of metadata. In an Oracle environment this is the Data Dictionary, which can be extended and enhanced. Most dictionaries of database contents, including the Oracle Data Dictionary, are limited in their immediate value as a metadata tool. Check the extending and enhancing capabilities of these dictionaries. Other Techniques Less-common sources of metadata include: • File definitions stored in COBOL copybooks • Middleware tools ..................................................................................................................................................... Data Warehousing Fundamentals 14-13 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Choosing the Metadata Location • • • Usually the warehouse server Possibly on operational platforms Desktop tool with metalayer External sources Operational data sources Metadata repository Warehouse Copyright Oracle Corporation, 1999. All rights reserved. Managing the Metadata • • • Managed by the metadata manager Maintained by the metadata architect Standards produced by the metadata architect External sources Operational data sources Metadata repository Warehouse Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-14 Data Warehousing Fundamentals Developing a Metadata Strategy ..................................................................................................................................................... Choosing the Metadata Location For every process and product employed in the data warehouse environment, metadata exists. Where it is stored is product-specific. The decision about where to place the metadata is often determined by the tool you use to create it. If you are using a relational database management system, then by default the metadata resides in the database and usually on the warehouse server. This is the preferred method. You may locate the metadata on a separate database on another machine. Some ETT and query tools have their own metalayer. Where this is the case you need to ensure that each metalayer can communicate with the others. Managing the Metadata Management Given the critical importance of metadata within the warehouse environment, it must be subject to strict control and management. Metadata is such a vital component in your warehouse implementation that someone should be responsible for managing and maintaining it. It is also important to ensure that creation of or changes to metadata are agreed upon with formal sign-off. Maintenance A metadata architect is usually responsible for defining the strategy and implementing metadata. This person is primarily responsible for ensuring that metadata remains up-to-date and consistently reflects any changes within the business infrastructure. If there are different metalayers, the architect must control integration of the metadata among products and tools. Standards As with any development project, standards are critical. Determine standards for every aspect of metadata from simple naming conventions, to versioning requirements, to documenting complex algorithms. Standards for metadata are emerging within the industry. It is worth monitoring the changes that vendors are considering, as well as the collaborative exercises between large computing companies who are looking to define standards. ..................................................................................................................................................... Data Warehousing Fundamentals 14-15 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Integrating Multiple Sets of Metadata • • • Multiple tools may generate their own metadata. There are many metalayer integration issues. Metadata exchangeability is desirable. Copyright Oracle Corporation, 1999. All rights reserved. Managing Changes to Metadata • Different types of metadata have different rates of change. • Consider metadata changes resulting from refresh cycles. Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-16 Data Warehousing Fundamentals Developing a Metadata Strategy ..................................................................................................................................................... Integrating Multiple Sets of Metadata Each of the tools you use in your warehouse environment might generate its own set of metadata. One of the biggest problems with metadata is integrating all of the different layers. Some vendors provide tools that can exchange metadata. For example, you can take metadata from Oracle Designer, populate it using Prism Directory Manager, and use it directly in Oracle Discoverer. Later in this lesson, we examine how Oracle Common Warehouse Metadata (CWM) addresses the sharing of metadata among Oracle tools. Managing Changes to Metadata Metadata changes at different rates according to the type of data stored. For example, models of operational and warehouse databases might remain static for a substantial period of time; however, metadata that maintains information about the warehouse data changes frequently. Each data refresh brings in more data each cycle. With it, summaries may change, dimensions may change, and more. ..................................................................................................................................................... Data Warehousing Fundamentals 14-17 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Examining Types of Metadata • • ETT metadata End user metadata Metadata repository End user ETT Warehouse Copyright Oracle Corporation, 1999. All rights reserved. ETT Metadata • • • • • • • • • • • Business rules Source tables, fields, and key values External Ownership sources Field conversions Extraction Encoding and reference table Operational Name changes data Key value changes sources Default values Logic to handle multiple sources Algorithms Time stamp Staging file Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-18 Data Warehousing Fundamentals Examining Types of Metadata ..................................................................................................................................................... Examining Types of Metadata Now we will examine more closely the different types of warehouse metadata. This includes ETT metadata generated during warehouse development, as well as end-user metadata. ETT Metadata ETT metadata defines how data from the physical level in the source system maps to the physical level in the data warehouse. ETT metadata also holds: • The business rules that are applied to the warehouse data • Names of the source tables, source fields, and source key values • Information about the owner of the source data • The rules that are applied to field conversions on a field-by-field basis • Encoding and reference table conversions • Field name and key value changes • Default values assigned to NULL fields • Logic to extract records from multiple source systems and create records (or a single record) for the load process • Algorithms that create derived data: Unit_Sold / Total_Sales = Selling_Price • Time-stamp details You have seen how complex the ETT process is, and you can now appreciate the importance of keeping a record of exactly what is happening, to which data and when, what the grain is, what is derived, how data is summarized, where it is sourced, and what its target is. ..................................................................................................................................................... Data Warehousing Fundamentals 14-19 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Extraction Metadata • • • • • • • • • • Space and storage requirements Source location information Diverse source data External sources Access information Extraction Security Contacts Program names Operational data sources Frequency details Failure procedures Validity checking information Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-20 Data Warehousing Fundamentals Examining Types of Metadata ..................................................................................................................................................... Extraction Metadata Extraction metadata contains: • Space requirement information • Storage frequency and duration details • Source location information such as hardware platform information, gateway information, operating system, file system, database, origin and destination information, and loading rules • Diverse system information with details of the source type such as whether the data is production, internal, external, or archive; structure information such as file type, name, field type, and data granularity • Access information such as alias names, versions, relationships, data volatility • Security information, table owners, data owners, authorization levels, audit trail information • Source data contact and owner details; for example, their names, telephone numbers, e-mail identifiers • Extraction program names • Temporary storage details, name of storage file, procedure for removing storage files • Extraction frequency details • Extraction failure procedures, with contingency plans and mechanism for handling failed extract • Extraction validity check information including the procedures to implement, expected results, procedures to follow should the validity check fail, names of the people to contact if the check fails ..................................................................................................................................................... Data Warehousing Fundamentals 14-21 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Transformation Metadata • • • • • • • Duplication routines Exception handling Key restructuring External sources Transform Operational data sources Staging file Grain conversions Program names Frequency Summarization Copyright Oracle Corporation, 1999. All rights reserved. Transportation Metadata • • • • • • Method of transfer Frequency Validation procedures Failure procedures Deployment rules Contact information Transform External sources Metadata repository ETT Transport Transport Operational data sources Staging file Warehouse Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-22 Data Warehousing Fundamentals Examining Types of Metadata ..................................................................................................................................................... Transformation Metadata Transformation metadata contains: • Duplication routines for elimination, consolidation, ordering, and summarization of data • Exception handling and validation procedures • Key restructuring rules • Granularity conversions • Transformation program names and locations • Frequency of the transformation • Summarization procedures Transportation Metadata Transportation metadata contains: • Data-transfer methods • Frequency of transportation • Validation procedures • Failure procedures • Rules for deployment • Contact information, in case of any issue with the data or the movement of data ..................................................................................................................................................... Data Warehousing Fundamentals 14-23 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... End-User Metadata 739516 1816 666 15 17.62 • • • Need to know the context of the table queried Associate the metadata description Analogous to Oracle Data Dictionary views Metadata repository End user Warehouse Copyright Oracle Corporation, 1999. All rights reserved. Example of End User Metadata Table Name Column Name Data Meaning Product Prodid 739516 Unique identifier for the product Product Valid_date 01/97 Last refresh date Product Ware_loc 1816 Warehouse location number Product Ware_bin Product Code 666 15 Product Weight 17.62 Warehouse bin number The color of the product; please refer to table COL_REF for details Packed shipping weight in kilograms Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-24 Data Warehousing Fundamentals Examining Types of Metadata ..................................................................................................................................................... End-User Metadata If the following data is warehouse data, how much can you deduce? 739516 0197 1816 666 15 17.62 You can deduce nothing tangible in this data other than a series of numbers. It could represent product codes, map coordinates, or employee salaries. The only way to deduce information from this data is to know the context of the table you are querying. For example, if you are querying the PRODUCT table and the PRODUCT CODE column, metadata may show the information as follows: Table Name Product Product Product Product Product Column Name Prodid Valid_date Ware_loc Ware_bin Color_code Data 739516 01/97 1816 666 15 Product Weight 17.62 Meaning Unique identifier for the product Last refresh date Warehouse location number Warehouse bin number The color of the product; please refer to table COL_REF for details Packed shipping weight in kilograms When you associate the data with its metadata description, the data becomes information. ..................................................................................................................................................... Data Warehousing Fundamentals 14-25 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... More End-User Metadata Information • • • • • Location of fact and dimensions Availability Description of contents Algorithms for derived and summary data Owners of data and telephone number Metadata Repository End user Warehouse Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-26 Data Warehousing Fundamentals Examining Types of Metadata ..................................................................................................................................................... More End-User Metadata Information The user never accesses end-user metadata directly. This metadata is viewed from the end user’s tool and is used to navigate around the data. Using this metadata, users can see the data available in the warehouse environment and establish the meaning of elements within the warehouse. User metadata describes: • The physical location of fact and dimension data. • The availability of the data. Not all data components of the warehouse are available to every user. Some facts may be sensitive to specific user groups. • The exact description of the contents and business algorithms used to create summary data. Users should never be in a position where they are guessing how a summary has been calculated. • How derived data has been created, the source data, and any algorithms used. • Data ownership details, so that if there are any problems with the data content, the user can ask the appropriate person questions about the data and identify or rectify the problems found. This information must supply telephone number, fax number, or e-mail address. Data ownership details are possibly the most important aspect of end-user metadata. If there is an issue with the data, it must be resolved quickly and appropriately. ..................................................................................................................................................... Data Warehousing Fundamentals 14-27 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Historic Context of Data • • Supports change history Maintains the context of information Operational Warehouse Metadata repository Structure Content 94 95 96 97 98 97 98 Copyright Oracle Corporation, 1999. All rights reserved. Types of Context • • • Simple: – Data structures – Naming conventions – Metrics Complex: – Product definitions – Markets – Pricing External: – Economic – Political Warehouse 94 95 96 Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-28 Data Warehousing Fundamentals Examining Types of Metadata ..................................................................................................................................................... Historic Context of Data Historic data often has business rules and algorithms applied that are different from those applied to current data. In the operational environment, there is only one definition of the database structure at any time. In the warehouse environment, data definitions change over a period of time. It is important to record the date when data changes, names, key values, default values, and algorithms to allow knowledge workers to analyze the data in the correct context. This ensures you can understand and identify the differences in the context of the data in historical files. For example, you may store data for 1994–1996 offline. Suppose you want to store 1997 data online. The default value for an amount field changed from a series of 9s to 0s in 1995. You can run a query to identify amounts between 1994 and 1997, but if you do not understand when and how default amounts were recorded, you may not be able to explain or understand why both 9s and 0s are stored, or realize the impact that the change has on calculations or reports. Another example arises with products such as personal computers that had very few components when they were first available. Consider the changes they have gone through and the many components they contain today. There is a rapid and voluminous history of change. Types of Context The context of data in the warehouse may be: • Simple contextual information such as data structures, data coding, naming conventions, and data metrics • Complex contextual information such as product definitions, market territories, pricing, packaging, and rule changes • External contextual information such as economic forecasts, political information, and competitive information ..................................................................................................................................................... Data Warehousing Fundamentals 14-29 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Additional Metadata Content and Considerations • • • • • • Summarization algorithms Relationships Stewardship Permissions Pattern analysis Reference tables Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-30 Data Warehousing Fundamentals Examining Types of Metadata ..................................................................................................................................................... Additional Metadata Content and Considerations Some of these points may have already been mentioned. Summarization Algorithms You have seen that the warehouse contains fully detailed fact records and summary records that are created according to predefined algorithms. The meaning of the summaries is maintained in the metadata. Relationships Relationships show how tables are related, their constraints and rules, and the cardinality of data. This relationship information is maintained in the metadata. This information is documented along with ownership information and text descriptions of tables and keys. Stewardship Metadata must identify the originator of data. Bear in mind that the data in the warehouse has come from many different source systems, with different suppliers, different owners, and different transformation issues. Permissions Metadata should maintain, for each record, information about who can access the records and who is authorized to grant permissions on it. Access Pattern Analysis Metadata should be able to record frequently accessed data, in order to tune and optimize performance accordingly. In turn, this may identify data accessed infrequently or not at all. You should remove data and summaries that are not accessed. Reference Tables The contents of these tables must be monitored and maintained with information that relates to their effective date. ..................................................................................................................................................... Data Warehousing Fundamentals 14-31 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Metadata Management Tools • • • • • • • • • Carleton Evolutionary Technologies Hewlett Packard Informatica Information Advantage Oracle Designer Platinum Technology Prism Solutions Sagent Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-32 Data Warehousing Fundamentals Metadata Management Tools ..................................................................................................................................................... Metadata Management Tools WTI Partner Carleton Corp. Evolutionary Technologies Hewlett Packard Informatica Information Advantage Oracle Platinum Technology, Inc. Prism Solutions Sagent Product Carleton Passport-Metadata ETI Repository (ObjectStore) IW Guide Informatica Repository Meta Agent Designer Data Mart Suite OADW/Warehouse Builder Data Dictionary/Solution Repository (DD/ S), Data Shopper, DB Excel, Platinum Repository Prism Directory Manager There are two categories of metadata management tools: • Generic repository tools, for managing enterprisewide metadata, such as: – Data Shopper from Platinum Technology – Data Dictionary from Brownstone/Platinum – Manager Link from Manager Software Products • Tools specifically for data warehouses and data marts, such as: – Prism Directory Manager from Prism Solutions – Meta Agent from Information Advantage – Passport from Carleton Corporation – SmartData Warehouse from Intersolv ..................................................................................................................................................... Data Warehousing Fundamentals 14-33 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Common Warehouse Metadata Analytic applications Report Operational data Query Warehouse ERP data Data integration Information delivery Analyze Marts External data Mine Metadata Design and Administration Copyright Oracle Corporation, 1999. All rights reserved. Common Warehouse Metadata Future Warehouse Builder Discoverer Common metadata Oracle8i Server Express Server Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-34 Data Warehousing Fundamentals Common Warehouse Metadata ..................................................................................................................................................... Common Warehouse Metadata Common Warehouse Metadata (CWM) is Oracle’s open standard for data warehousing metadata. CWM incorporates both technical and business meta data and covers all aspects of warehousing. CWM will enable tighter integration of metadata among Oracle’s products as well as across industry-leading tools from Oracle partners, resulting in reduced implementation complexity and greater user productivity. To enable truly open data warehouse functionality, Oracle submitted a Request for Proposal for a Common Warehouse Metadata Interchange standard to the Object Management Group (OMG). The Common Warehouse Metadata Interchange (CWMI) standard will enable the interchange of warehouse metadata among data management and analysis tools, and among warehouse metadata repositories. One Meaning Oracle acquired One Meaning, a company specializing in metadata. One Meaning’s metadata technology provides the means for metadata interoperability and transfer, reduces the cost of managing information resources, and enhances the value of stored proprietary information. Oracle’s metadata strategy will provide essential integration and continuity, and add ongoing value to data warehousing implementations. ..................................................................................................................................................... Data Warehousing Fundamentals 14-35 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Summary This lesson discussed the following topics: • • • • • • • Definitions Integration Contents Storage Creation Selection Tools Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-36 Data Warehousing Fundamentals Summary ..................................................................................................................................................... Summary This lesson discussed the following topics: • The definitions of the two main types of metadata • The problems associated with metadata in the warehouse • Metadata contents • How metadata might be created • Where metadata may be stored in a warehouse environment • Selection criteria for metadata management tools • A list of metadata management tools available from WTI partners and Oracle ..................................................................................................................................................... Data Warehousing Fundamentals 14-37 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... Practice 14-1 Overview This practice covers the following topic: Answering a series of short questions Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 14-38 Data Warehousing Fundamentals Practice 14-1 ..................................................................................................................................................... Practice 14-1 1 Why is metadata important to the following people? Users who are accessing the data warehouse ________________________________________________________ ________________________________________________________ b IT staff developing ETT routines ________________________________________________________ ________________________________________________________ Name two techniques you might employ to create metadata. ________________________________________________________ ________________________________________________________ Name two roles within the data warehouse development team who have responsibility for metadata. ________________________________________________________ ________________________________________________________ What is the issue with integration and metadata? ________________________________________________________ ________________________________________________________ ________________________________________________________ What is important about the context of data? ________________________________________________________ ________________________________________________________ Name the Oracle tools you may use to develop metadata. ________________________________________________________ a 2 3 4 5 6 ..................................................................................................................................................... Data Warehousing Fundamentals 14-39 Lesson 14: Leaving a Metadata Trail ..................................................................................................................................................... ..................................................................................................................................................... 14-40 Data Warehousing Fundamentals 15 ................................. Supporting End-User Access Lesson 15: Supporting End-User Access ..................................................................................................................................................... Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Meeting a Business Need Choosing a Computing Architecture Planning Warehouse Storage Modeling the Data Warehouse ETT (Building the Warehouse) Analyzing User Query Needs Supporting Supporting End End User User Access Access Managing the Data Warehouse Project Management (Methodology, Maintaining Metadata) Copyright Oracle Corporation, 1999. All rights reserved. ® Objectives After completing this lesson, you should be able to do the following: • • • Describe the importance of business intelligence • Identify data mining tools Identify multidimensional query techniques Identify where data mining might be employed in a warehouse environment Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-2 Data Warehousing Fundamentals Overview ..................................................................................................................................................... Overview The previous lesson covered leaving a metadata trail. This lesson discusses supporting end-user access. Note that the “Supporting End User Access” block is highlighted in the course road map on the facing page. Specifically, this lesson introduces the concept of business intelligence. The lesson discusses the discovery model used by mining tools, and the reasons enterprises are looking at data mining solutions for discovery of information. Objectives After completing this lesson, you should be able to do the following: • Describe the importance of business intelligence • Identify multidimensional query techniques • Identify where data mining might be employed in a warehouse environment • Identify data mining tools ..................................................................................................................................................... Data Warehousing Fundamentals 15-3 Lesson 15: Supporting End-User Access ..................................................................................................................................................... What Is Business Intelligence? “Business Intelligence is the process of transforming data into information and through discovery transforming that information into knowledge.” Gartner Group ® Copyright Oracle Corporation, 1999. All rights reserved. Business Intelligence The purpose of business intelligence is to convert the volume of data into value for the end users. Decision Stages (4) Value Knowledge Information Data Copyright Oracle Corporation, 1999. All rights reserved. Volume ® ..................................................................................................................................................... 15-4 Data Warehousing Fundamentals Business Intelligence ..................................................................................................................................................... Business Intelligence Companies require business intelligence to direct business process improvement and monitor time, cost, quality, and control. Definition Howard Dressner, analyst with the Gartner Group, defines business intelligence as a process of turning data into information and through iterative discoveries turning that information into business intelligence. The key is that business intelligence is a process—cross functional, in line with current management thinking, and not presented in IT terms. Purpose of Business Intelligence The purpose of business intelligence is to the large volumes of data into information, linking bits of information together within a decision context that turns it into knowledge that can be used to aid decision making. This can be accomplished through the use of data access tools and techniques that use organized collections of data, systems, and applications by which organizations gather and interpret relevant information about the business and turn it into highly quantifiable plans, policies, procedures, and metrics. The value chain begins with data resource. Data is defined as facts and figures. Information is data processed and interpreted into a meaningful framework. It is a set of data in context that is relevant to one or more people at a point in time or for a period of time. Knowledge refers to meaning and understanding that results from processing information by users. In order for knowledge to be useful in the decision making process, there must be a high-quality integrated resource, high-quality information preparation and sharing, and a high-quality human resource to discover and accumulate knowledge to achieve successful business intelligence. ..................................................................................................................................................... Data Warehousing Fundamentals 15-5 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Multidimensional Query Techniques Product Slicing Time Why? What? Why? Dicing Geography Why? Drilling down Copyright Oracle Corporation, 1999. All rights reserved. ® Multidimensional Query Techniques Why? What? Why? Drilling up Drilling across Why? Pivoting Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-6 Data Warehousing Fundamentals Multidimensional Query Techniques ..................................................................................................................................................... Multidimensional Query Techniques These techniques are standard in modern query tools that present data in a multidimensional manner. The following defines some of the common multidimensional query techniques. Slicing Slicing means limiting the view of data to a selection of the data to a selection of consultant, region, or cost center. An example of a slice of data can be a view of the data for a regional manager across all products and time periods Dicing Dicing is slicing in multiple directions. You are making the selection along more than one dimension. In dicing, you can refine the selection by adding or removing data more of the data cube. Drilling Drilling is being able to open up a subset of data that corresponds with a particular value of a dimension. It is a term used to describe the action of moving down to further levels of data detail or up to higher levels of summary data. Drilling Down Is a mechanism that enables the user to examine the detail for a summary value. The user may examine where rackets were sold, to what companies, and how many items any individual purchased. Drilling Up Is the ability to query detail records and navigate up to higher level summary records. Drilling Across report. Is the ability to query from one fact table to another in a single Pivoting Pivoting data is changing the axes along which you orient your data. It also refers to the ability to change the organization of rows and columns in a tabular report. This enables the user to view the data along different dimensions without requerying the database itself. OLAP has other associated query techniques, some of which are vendor dependent. For example top/bottom analysis selects the top or bottom ranges of data based on criteria to perform exception reporting. ..................................................................................................................................................... Data Warehousing Fundamentals 15-7 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Categories of Business Intelligence Tools • • • • • • Reporting tools Query tools (data access) On-line analytical reporting (OLAP) tools Analytical suites Data mining tools Analytical applications ® Copyright Oracle Corporation, 1999. All rights reserved. Evolution of Reporting ClientServer Mainframe Multitier enterprise reporting • Batch oriented • End user empowered • Easy to use • IS controlled • Reduced IS manageability • Manageable • 3GL-based • Expensive • Scalable • Not user-specific • Localized • Accessible • Inflexible • IS intensive Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-8 Data Warehousing Fundamentals Categories of Business Intelligence Tools ..................................................................................................................................................... Categories of Business Intelligence Tools According to Wayne Eckerson from the Data Warehousing Institute (Criteria for Evaluating Business Intelligence Tools, Journal of Data Warehousing, Volume 4, Number 1, Spring 1999), the categories of business intelligence tools are: • Reporting tools • Query tools (data access) • Online analytical reporting (OLAP) tools • Analytical suites • Data mining tools • Analytical applications Reporting Tools The tools allow users to produce canned, graphic-intensive, sophisticated reports based on the warehouse data. The evolution of reporting is shown below. Mainframe In the mainframe era, batch reporting generated large, cumbersome reports. These reports were constructed from time consuming, difficult to use 3GL programming environments. Client/Server The advent of the PC brought rich graphical user interfaces, leading to the introduction of much more productive 4GL reporting tools. This, combined with the advent of client-server computing, began to deliver much more user-friendly and flexible reports. Enterprise Reporting We are now in the enterprise reporting era. This new reporting architecture delivers the combined benefits of mainframe and client-server environments. Oracle Reports is an enterprise reporting tool for developers to build and disseminate sophisticated, high-quality reports. Users view reports dynamically generated by the application-server-reporting engine. Users can access reports from anywhere in the enterprise using a web browser. Oracle Reports takes advantage of the scalability of the internet computing model. The powerful reports server helps you to easily deploy your applications in a multi-tier environment that uses an advanced caching technology to provide dynamic load balancing. ..................................................................................................................................................... Data Warehousing Fundamentals 15-9 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Oracle Discoverer 3.1 User Viewer Edition Edition End User Layer Transaction Database or Data Warehouse/Mart Administration Edition Copyright Oracle Corporation, 1999. All rights reserved. ® Discoverer for the Web • • View workbooks using a Web browser • Cost-effective Business intelligence tool that provides information anywhere and at any time Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-10 Data Warehousing Fundamentals Categories of Business Intelligence Tools ..................................................................................................................................................... Query Tools These tools enable users to explore a data source using intuitive ad hoc queries. The tools provide the means for pulling the desired information from a database. They are typically SQL-based tools and allow a user to define data in end-user language. Oracle Discoverer is Oracle’s award-winning ad hoc query, reporting and analysis tool designed by end users for the end users. Oracle Discoverer for the Web makes it easy for any user to leverage information in data warehouses, data marts, and relational databases using a web browser. It features industry-leading ease of use and performance features such as query prediction and automatic summary management which provide time and cost savings for the enterprise. The components of Oracle Discoverer 3.1 are shown below. Discoverer User Edition As an end user, you use this component to perform ad hoc queries, generate reports, and publish information stored in the online dictionary. Discoverer Administration Edition Business and information technology (IT) data administrators use this component to create, maintain, and administer data and the users’ interaction with that data. End-User Layer This component, a server-based meta layer, hides the complexity of the underlying relational database so that you can interact with the online dictionary using ordinary business terms. Discoverer Viewer Edition As an end user, you use this component to view your data using a Web browser. Using the Discoverer Viewer, you can view the workbooks that you have created in the User Edition, through the Internet. You can use Internet Explorer 4.0 or Netscape 4.05 or higher browsers to access Discoverer Viewer, and it takes advantage of the existing Discoverer installations, thus providing easy access at any time to the workbooks stored in the database. Because of the consistent user interface between the User Edition and the Viewer Edition, users can easily work with their stored workbooks in Discoverer Viewer without any additional training. The following features are available in Discoverer Viewer: • View workbooks stored in the database • Use drilling • Refresh data • Print reports • Provide parameters to view specific data • Customize the execution of queries ..................................................................................................................................................... Data Warehousing Fundamentals 15-11 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Online Analytical Processing (OLAP) Product mgr. view Regional mgr. view Prod Market Sales Time Financial mgr. view Ad hoc view Copyright Oracle Corporation, 1999. All rights reserved. ® Advanced Analytical Tasks • • • • • • • Comparative and relative analysis Exception and trend analysis Time series analysis Forecasting What-if analysis Modeling Simultaneous equations Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-12 Data Warehousing Fundamentals Categories of Business Intelligence Tools ..................................................................................................................................................... Online Analytical Reporting (OLAP) Tools OLAP tools provide a multidimensional view of data, allowing users to easily navigate through multiple dimensions (such as customer, organization and time) and hierarchies within dimensions (such as year, quarter, and month). The different types of tools in this category are multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and hybrid OLAP (HOLAP). They have been discussed in Lesson 6. Oracle Express Oracle Express provides sophisticated online analytical processing (OLAP) analysis through its advanced calculation engine and multidimensional data cache. The Express multidimensional data model is optimized for the query and analysis of corporate data, such as sales, marketing, financial, manufacturing, or human resource data. Oracle Express provides a native multidimensional data model for optimal OLAP power and performance. The multidimensional model: • Is specifically designed for analysis • Inherently reflects the way users think about their businesses • Ensures that end users can efficiently analyze data in a structured or ad hoc fashion, without requesting special programs from IS personnel Through its built-in analytic functions, Oracle Express provides the answers to a range of complex analytic questions. Oracle Express enables users to perform advanced analytical tasks, such as: • Comparative and relative analysis • Exception and trend analysis • Modeling • Forecasting • Time-series analysis • What-if analysis It delivers powerful analytical capabilities to any Web browser, enabling sophisticated analysis over corporate intranets and the Internet. ..................................................................................................................................................... Data Warehousing Fundamentals 15-13 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Analytical Suites • Enterprise business intelligence (EBI) toolsets: – Web-enabled query, reporting, and analysis tool that runs on a robust application server – EBI toolset tightly integrates query, reporting, and analysis capabilities within a single tool – Shares a common look and feel • Business portals: – EBI toolset with a Yahoo!-like user interface – Flexible repository handles structured and unstructured data objects Data Warehousing Institute Copyright Oracle Corporation, 1999. All rights reserved. ® Data Mining Tools • Identify patterns and relationships in data that are often useful for building models that aid decision making or predict behavior • Data mining uses technologies such as neural networks, rule induction, and clustering to discover relationships in data and make predictions that are hidden, not apparent, or to complex to be extracted using statistical techniques. Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-14 Data Warehousing Fundamentals Categories of Business Intelligence Tools ..................................................................................................................................................... Analytical Suites According to Wayne Eckerson from the Data Warehousing Institute, the tools in the analytical suites are as follows. Enterprise Business Intelligence (EBI) Toolsets An EBI toolset is a Web-enabled query, reporting, and analysis tool that runs on a robust application server instead of a desktop machine. An EBI toolset tightly integrates query, reporting, and analysis capabilities within the context of a single tool as opposed to a suite of tools. Each analytical “modality” shares a common look and feel and passes data seamlessly to each of the other modalities, as required. Web and client-server versions offer equivalent functionality. Business Portals A Business Portal is an EBI toolset with a Yahoo!-like user interface. This tool has a flexible repository that handles structured and unstructured data objects, and a publish or subscribe engine that delivers reports to users on a customizable basis. (Criteria for Evaluating Business Intelligence Tools, Journal of Data Warehousing, pg. 29, Volume 4, Number 1, Spring 1999) Data Mining Tools Data mining tools identify patterns and relationships in the data that are often useful for building models that aid decision making or predict behavior. Data mining uses technologies such as neural networks, rule induction, and clustering to discover relationships in data and make predictions that are hidden, not apparent, or to complex to be extracted using statistical techniques. Note: Data mining will be covered in the next section. ..................................................................................................................................................... Data Warehousing Fundamentals 15-15 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Analytical Applications • Packaged analytical application has a predefined: – Extraction feeds and transformation routines for a specific data source – Data model, application-specific report templates, and a custom end-user interface. • Custom analytic applications are workbenches that enable developers to quickly create analytic applications from coarse-grained components, including user interface widgets, data access and analysis components, and report layouts. Data Warehousing Institute Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-16 Data Warehousing Fundamentals Categories of Business Intelligence Tools ..................................................................................................................................................... Analytical Applications According to Wayne Eckerson from the Data Warehousing Institute, “Analytical applications incorporate business intelligence tools and a data warehouse or data mart to deliver analytical capabilities within a well-defined business process. An analytical application uses a custom interface to step users through a set of data collection and analysis tasks that lead up to a decision. The analytical application also provides the context for users to act on their business decisions, whether it involves emailing a document, updating a database, or initiating a workflow.” (Criteria for Evaluating Business Intelligence Tools, Journal of Data Warehousing, pg. 29, Volume 4, Number 1, Spring 1999) The tools in the analytical applications are described below. Packaged Analytic Application Packaged analytic applications come with a predefined extraction feeds and transformation routines for a specific data source, a predefined data model, application-specific report templates, and a custom end-user interface. Custom Analytic Application The custom analytic applications are workbenches that enable developers to quickly create analytic applications from coarse-grained components, including user interface widgets, data access and analysis components, and report layouts. (Wayne Eckerson, Criteria for Evaluating Business Intelligence Tools, Journal of Data Warehousing, pg. 29, Volume 4, Number 1, Spring 1999) ..................................................................................................................................................... Data Warehousing Fundamentals 15-17 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Definition of Data Mining “Data mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns, trends, relationships, and rules.” Data mining is also known as: • • • Knowledge discovery Data surfing Data harvesting ® Copyright Oracle Corporation, 1999. All rights reserved. Uses of Data Mining • • • • • Customer profiling Market segmentation 1000 2000 2000 3456 6577 Buying pattern affinities 2000 56600 78797 990 Database marketing 90091 87885 4565 12854 Credit scoring and risk analysis 12090 123599 279878 999 109988 1987363 10928783 33345 67398 320793 39384 320983 57583 398 209 8378373 10076 354802 2973673 3939399 306145 01910 46458 817262 Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-18 Data Warehousing Fundamentals Data Mining in a Warehouse Environment ..................................................................................................................................................... Data Mining in a Warehouse Environment Definition of Data Mining Data mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns, trends, relationships, and rules. The purpose of data mining is to enable proactive business decisions. Data mining tools empower the user to search for patterns of information in data. Data mining is far less user-directed and relies upon specialized algorithms, such as fuzzy logic, neural networks, genetic algorithms, and induction, that correlate information from the data warehouse and assist in trend analysis. Data mining also refers to a process rather than a technology, the goal of that process being to explore large amount of data to discover new trends, relationships, and categories in that data. Data minng is also referred to as knowledge discovery, data surfing, or data harvesting. Uses of Data Mining Data mining has many applications: • Store owners can use it to determine and market products according to user classification. – Affinities – Purchasing patterns – Goods purchased (basket analysis) • Business analysts can use it to determine patterns of product purchases. – Fraud detection – Profile buying patterns – Determining high-and-low risk customers • Credit card suppliers can use it to target an audience for a new card service. Credit scoring and risk analysis in financial institutions. Data mining techniques can be used by anyone who needs to: • Develop strategies for marketing • Target mail lists • Adjust inventory levels • Minimize operational and financial risks to the business • Keep costs to a minimum • Find out something new and never before considered ..................................................................................................................................................... Data Warehousing Fundamentals 15-19 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Functions of Data Mining • • • • • • • Discovers facts and data relationships Finds patterns Determines rules Retains and reuses rules Presents information to users May take many hours Requires knowledgeable people to analyze the results Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-20 Data Warehousing Fundamentals Data Mining in a Warehouse Environment ..................................................................................................................................................... Functions of Data Mining Discovery Data mining queries discover facts and data relationships using techniques such as association, frequency of occurrence, and sequential patterns. Rule Retention Data mining techniques learn patterns, and create rules to describe the patterns; the rules are retained for reuse against larger data sets of data for further analysis. Self-Motivating Some data mining queries require little human intervention, but do need guidance. Certain data mining models, such as cluster analysis, do not require any guidance at all. On the whole, data mining tasks are a guided discovery of data, that is, you have a notion of what it is you are trying to find out—information about debtors or selling patterns, for example. Expert Analysis The results of a query, once presented, need knowledgeable people to analyze and use them correctly. ..................................................................................................................................................... Data Warehousing Fundamentals 15-21 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Comparing DSS and Data Mining Queries • DSS queries: – Based on prior knowledge and assumptions – User-driven • Data mining queries: – Require domain-specific knowledge to interpret data – User-guided Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-22 Data Warehousing Fundamentals Data Mining in a Warehouse Environment ..................................................................................................................................................... Comparing DSS and Data Mining Queries Decision support queries are driven by a user who knows how to pose a question in order to achieve specific results. The user knows what the question is and requires the DSS application only to supply the answer. Therefore, the user applies known parameters to the query prior to execution, in order to achieve a result based on those known parameters. Data mining queries differ in that the user provides some initial guidance. It requires users to have the domain-specific knowledge to interpret the data. Data mining can find answers to problems and information you have not considered before. ..................................................................................................................................................... Data Warehousing Fundamentals 15-23 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Artificial Neural Networks • • • Predictive model that learns Developed from understanding of the human brain Multiple regression and other statistical techniques 1 5 2 6 3 8 7 4 Inputs Hidden layer Outputs ® Copyright Oracle Corporation, 1999. All rights reserved. Decision Trees • • • Represent decisions Annual salary 100,000 Generate rules Classify Annual outgoing <10,000 Good Copyright Oracle Corporation, 1999. All rights reserved. Annual credit > 50,000 Bad ® ..................................................................................................................................................... 15-24 Data Warehousing Fundamentals Data Mining in a Warehouse Environment ..................................................................................................................................................... Data Mining Techniques Artificial Neural Networks Neural networks are nonlinear predictive models that learn through training. They look like biological neural networks in structure. A neural network is a network of processors, each of which contains an amount of local memory. The units are connected by communication channels carrying numeric data, encoded by various means. The processors operate only on their local data and on the inputs they receive through the communication channels. The field of neural networks arose from the development of artificial intelligence systems (among other technologies) capable of sophisticated computations similar to those performed by the human brain. Much of the improvements in neural network technology have been applied since there has been much improved understanding of how the human brain functions. Most neural networks have a training rule whereby the weights of communications are adjusted based on the data; that is, they learn from examples. Neural networks are employed by statisticians, engineers, scientists, and neurophysiologists to explore brain function. Neural networks can be used for classification, clustering, modeling, determining sequences, and multiple regression and other statistical techniques. Decision Trees These are tree-shaped structures that show a route taken by a certain decision, or a series of decisions. Each decision generates a rule to classify the data that it returns. A bank may use a decision tree to determine the worthiness of a customer requesting a loan (is the customer a good or a bad risk?). This is classification. Some tools that support decision tree technology (rather than data mining technology) can display decision tree results graphically. ..................................................................................................................................................... Data Warehousing Fundamentals 15-25 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Other Techniques • • • • • Genetic algorithms based on evolution theory Statistics such as averages and totals Nearest neighbor to find associations Rule induction applying IF-THEN logic Experiment with different techniques K K K K K K K K K K K Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-26 Data Warehousing Fundamentals Data Mining in a Warehouse Environment ..................................................................................................................................................... Data Mining Techniques (continued) Genetic Algorithms These are essentially optimization techniques using processes such as natural selection and genetic combination. The design is based on the concepts of evolution (Darwin’s theory of the survival of the fittest) and mutation theories. Statistics and Quantitative Analysis Data mining uses statistics based on linear models that may be quite complex, such as averages, distributions, ranking, regression, clustering, and other statistical techniques. There is an overlap between the fields of neural networks and statistics. Nearest Neighbor This technique is used for finding associated or clusters of records. It classifies each record in a select set of data, based on a combination of the classes of the K records most similar to it, where K is greater than or equal to one. Rule Induction Data mining can extract useful IF-THEN rules based on the statistical significance of the data. Rule induction allows you to find data associations and sequences, and employs decision tree techniques for prediction and analysis. No single mining technique can be recommended in isolation. The data to be analyzed varies between businesses; the hypotheses tested are diverse. You should consider employing as many techniques as the tool allows; you must experiment. Note: There are many other techniques used in data mining. This is just a sample selection. ..................................................................................................................................................... Data Warehousing Fundamentals 15-27 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Associations Which items are purchased in a retail store at the same time? Copyright Oracle Corporation, 1999. All rights reserved. ® Sequential Patterns What is the likelihood that a customer will buy a product next month, if he buys a related item today? Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-28 Data Warehousing Fundamentals Data Mining in a Warehouse Environment ..................................................................................................................................................... Typical Data Mining Results Associations Data mining can discover associations between items, that is, how items relate to each other. It answers questions such as, “Which items are purchased in a retail store at the same time?” For example, shirts and ties, eyeliner and mascara, or cameras and televisions. However, this result does not determine the rationale behind the association. Sequential Patterns Data mining can describe associations over some period of time. It can answer questions such as, “What is the likelihood that a customer will buy a product in the future, if he buys a related item today?” For example, personal computer today, printer next month; or a set of tools today and the toolbox to put them in tomorrow. Patterns involving time emerge. For example, if a customer buys a set of tools today, there may be a pattern that shows the percentage likelihood of the toolbox being purchased tomorrow, within one week, or within two weeks. This is a good way for a retail store to determine a marketing campaign. Classification results enable the store to target the correct customer at the same time. ..................................................................................................................................................... Data Warehousing Fundamentals 15-29 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Classifications Determine customers’ buying patterns, and then find other customers with similar attributes that may be targeted for a marketing campaign. ® Copyright Oracle Corporation, 1999. All rights reserved. Modeling Use factors, such as location, number of bedrooms, and square footage, to determine the market value of a property Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-30 Data Warehousing Fundamentals Data Mining in a Warehouse Environment ..................................................................................................................................................... Typical Data Mining Results (continued) Classification Data mining can divide items into groups. Determine customers’ buying patterns, and then find out other customers with similar attributes that may be targeted for a marketing campaign: credit card users with balances within 10% of their maximum credit limit; people employed in the construction industry. Modeling Data mining can map a set of input values to a single output value. For example, you may use factors such as location, number of bedrooms, and square footage to determine the market value of a property. ..................................................................................................................................................... Data Warehousing Fundamentals 15-31 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Oracle Data Mining Partners • • • • • • • Angoss International, Ltd. DataMind Corp. Datasage, Inc. Information Discovery, Inc. SPSS Inc. SRA International, Inc. Thinking Machines Corp. Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-32 Data Warehousing Fundamentals Oracle Data Mining Partners ..................................................................................................................................................... Oracle Data Mining Partners WTI Partner Angoss International, Ltd. DataMind Corp. Datasage, Inc. Information Discovery, Inc. SPSS Inc. SRA International, Inc. Thinking Machines Corp. Product KnowledgeSeeker IV is a data mining software tool that uses a unique cross-referencing process to enable businesses to analyze varied and disparate databases. DataMind DataCruncher provides fast, accurate data mining capabilities for making sense of corporate data. DataSage Mining Manager provides a robust infrastructure to develop, deploy, and manage enterprise data mining applications ensuring a complete solution that will increase corporate profitability and reduce the time to ROI for data mining projects. Data Mining Suite is an integrated set of products providing powerful, complete, and comprehensive solutions for large-scale enterprisewide decision support and data mining. Rapid Pilot Data Mining is designed for Fortune 2000 companies wanting to accelerate the data-mining introduction process and quickly gain notable results. Knowledge Access Suite has delivered the first and only set of products ever to provide business users with a gateway to knowledge predistilled from raw data and stored in a pattern base. SPSS is an open, best-of-breed data mining solution that delivers each of the four A’s of data mining, access, analysis, action, and automation. KDD Explorer is an easy-to-use data mining toolset that assists business analysts in the discovery and analysis of novel patterns in terabyte-sized databases. LoyaltyStream is a complete solution that includes specific applications, software, user training, and expert consulting services for understanding customer behavior, building mining marts, building predictive models, and deploying models throughout an enterprise. ..................................................................................................................................................... Data Warehousing Fundamentals 15-33 Lesson 15: Supporting End-User Access ..................................................................................................................................................... Summary This lesson covered the following topics: • • Describing the importance of business intelligence • Identifying data mining tools Identifying where data mining might be employed in a warehouse environment Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-34 Data Warehousing Fundamentals Summary ..................................................................................................................................................... Summary This lesson covered the following topics: • Describing the importance of business intelligence • Identifying where data mining might be employed in a warehouse environment • Identifying data mining tools ..................................................................................................................................................... Data Warehousing Fundamentals 15-35 Lesson 15: Supporting End-User Access ..................................................................................................................................................... . Practice 15-1 Overview This practice covers the following topics: • Identifying the type of analysis based a description of a scenario • Matching the category of information with a list of description • Identifying data mining techniques Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 15-36 Data Warehousing Fundamentals Practice 15-1 ..................................................................................................................................................... Practice 15-1 1 In the following scenarios, choose the type of analysis that most accurately defines the scenario. The types of analysis from which you may choose are: – Query and reporting – Multidimensional/OLAP – Data mining – Drill-down and pivot – Calculations and derived data – Spreadsheet – Modeling, time-series and financial – What if Scenario a. Show start date and salary grade for all employees reporting to Clare Maury b. Highlight all orders above $30,000.00 • Drill from product totals to individual orders • Look at a copy of the invoice c. Show product sales in each region as a percentage of the total sales in that region. d. Did the $2 million promotion increase sales? e. How many people to hire, when to hire them, and where to locate them. f. If we lowered prices, would our overall revenue increase? g. Find me the relationship between X and Y. h. Show me all the products that are currently back-ordered. i. What is the 13 week moving average of sales? j. Projecting costs and allocating overhead based on head count, sales forecasts, and consumer price index (CPI). Type of Analysis 2 For the following phrases and sentences, determine which category each of them belongs to. You may choose from the following list. • Data • Information • Knowledge ..................................................................................................................................................... Data Warehousing Fundamentals 15-37 Lesson 15: Supporting End-User Access ..................................................................................................................................................... • Decision Description Mary lives in Belmont Shores, California. Point of sale (POS) AppleTree juice is bought 45% of the time that Crystal Geyser juice is bought. Let us promote Crystal Geyser juice on the East Coast of the United States in stores. Demographic Customers of the upper middle class will use 10% of their annual income during the Christmas holiday season. Category 3 The diagram below illustrates an example of data mining. The technique that it uses is called _________________. Age Region Loyal Call Rate Lost Service 4 The description below describes a data mining technique. What is the technique used? 1. 2. 3. 4. 5. 6. If the vehicle has a 2-door frame AND If the vehicle has at least six cylinders AND If the buyer is less than 40 years old AND If the cost of the vehicle is > $35,000 AND If the vehicle color is red, THEN The buyer is likely to be male. ..................................................................................................................................................... 15-38 Data Warehousing Fundamentals 16 ................................. Web-Enabling the Warehouse Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Meeting a Business Need Choosing a Computing Architecture Planning Warehouse Storage Modeling the Data Warehouse ETT (Building the Warehouse) Analyzing User Query Needs Supporting Supporting End End User User Access Access Managing the Data Warehouse Project Management (Methodology, Maintaining Metadata) Copyright Oracle Corporation, 1999. All rights reserved. ® Objectives After completing this lesson, you should be able to do the following: • • • Explain how the Web can expand data warehouse usage Describe the issues involved in putting a data warehouse on the Web Outline the requirements for evaluation Web-based query and analysis tools Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-2 Data Warehousing Fundamentals Overview ..................................................................................................................................................... Overview The previous lesson covered supporting end-user access. This lesson discusses Webenabling the warehouse which is also another aspect of supporting end-user access to the warehouse. Note that the “Supporting End User Access” block is highlighted in the course road map on the facing page. Specifically, this lesson discusses how to take advantage of the Web to deploy data warehouse information. It addresses internal and external access, as well as the advantages of Web-enabling a data warehouse. The lesson outlines the steps involved in deploying a Web-enabled data warehouse. Challenges in deploying a Web-enabled data warehouse are also discussed. Objectives After completing this lesson, you should be able to do the following: • Explain how the Web can expand data warehouse usage • Describe the issues involved in putting a data warehouse on the Web • Outline the requirements for evaluating Web-based query and analysis tools ..................................................................................................................................................... Data Warehousing Fundamentals 16-3 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Benefits of Web-Enabling a Data Warehouse • • • • • Better-informed decision making • Greater collaboration among users Lower costs of deployment and management Lower training costs Remote access Enhanced customer service and improved image as a technology leader Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-4 Data Warehousing Fundamentals Accessing the Warehouse Over the Web ..................................................................................................................................................... Accessing the Warehouse Over the Web A Web-enabled data warehouse is a means of providing access and query availability to your data warehouse by using a standard Web browser. It allows your users to perform ad hoc queries against the database using their choice of Web browsers. The primary purpose of Web-enabling a data warehouse is to give remote offices and mobile professionals the information they need to make tactical business decisions. Companies are increasingly aware that the Internet can help them reach out to new markets and increase their values to customers, particularly by offering individualized, one-to-one marketing. Benefits of Web-Enabling a Data Warehouse Deploying data warehouse applications on the Web is becoming increasingly popular. The benefits of a Web-enabled data warehouse are: • Better-informed decision making: Users with access to more comprehensive information and analyses can make better decisions, with the results directly affecting the organization’s bottom line. • Lower costs of deployment and management: A Web browser serves many clients from a single location, reducing the number of installations and upgrades needed, and reducing the cost of support. • Lower training costs: After a user is trained in the use of a Web browser, the user is equipped to access and use most of the resources on the corporate intranet. • Improved return on investment (ROI): Increasing the use of data warehouse spreads its value among more users and shortens the time for data warehouse ROI. • Remote access: The ability to put information to use out of the office is greatly expanded, because through the Web, users can access the information anytime and anywhere. • Enhanced customer service and improved image as a technology leader: Up-todate information can be made available immediately to a wide range of users, allowing them to help themselves and get an immediate response to their questions. • Greater collaboration among users: Users can share information and analysis across organizations. ..................................................................................................................................................... Data Warehousing Fundamentals 16-5 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Challenges of Web-Enabling a Data Warehouse • • • • • Security Business value Impact assessment Setup and management Tools and support for global requirements Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-6 Data Warehousing Fundamentals Accessing the Warehouse Over the Web ..................................................................................................................................................... Challenges of Web-Enabling a Data Warehouse According to the Hurwitz Group, putting a data warehouse on the Web offers tremendous benefits but it also presents some technical and organizational challenges. • Security: The loss of data warehouse data to hostile parties can have extremely serious legal, financial, and competitive impacts on an organization. Make sure that your solution has strong encryption, authorization, and authentication services. • Business value: In order to succeed in Web-enabling your data warehouse, you need to have a warehouse sponsor who will help to develop a clear business case for putting the warehouse on the Web. Some of the questions to answer include: – What are users going to do with the Web-enabled data warehouse? – Who will you allow to access the Web-enabled data warehouse? – What will users be allowed to use the Web-enabled data warehouse for? – How will this affect other departments, such as order processing, sales, indirect channels and other business partners, and customer support? • Impact assessment: You need to assess the impact a Web-enabled warehouse will have on your IT organization and infrastructure. This includes: – Changes in utilization patterns and the number of active clients – The need to learn new skills, such as integrating a warehouse database with a Web server – Other areas of consideration: Networks, servers, failover and recovery procedures, development and testing tools, and training programmers as well as operators • Setup and management: You need to consider how people will use the warehouse and what impact their behavior will have on performance, availability, throughput, and network bandwidth. You need to select among three basic query approaches: – Static pages – Dynamic pages – Dynamic queries • Tools and support for global requirements: Because putting your warehouse on the Web stresses its load and capacity, you will need good tools for managing the system, especially the network and various servers. You must ensure that your vendors’ support services will meet your global support requirements. (Source: Robert Craig, Data Warehousing and the Web. Hurwitz Group. September/October, 1997) ..................................................................................................................................................... Data Warehousing Fundamentals 16-7 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Common Web Data Warehouse Architecture Common Gateway Interface Gateway program Web server HTML Warehouse database Client browser ® Copyright Oracle Corporation, 1999. All rights reserved. Common Web Data Warehouse Architecture OLAP server Warehouse server Common Gateway Interface (CGI) Object Request Broker Cartridge Servlets Netscape Server API (NSAPI) Internet Server API (ISAPI) Web server Windows clients Client browser World Wide Web Client Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-8 Data Warehousing Fundamentals Common Web Data Warehouse Architecture ..................................................................................................................................................... Common Web Data Warehouse Architecture The warehouse may be accessed through a browser using a standard gateway interface. The requestor accesses the Web server, using the Uniform Resource Locator (URL) address. The protocol between the requestor and the server is hypertext transfer protocol (HTTP). The text document that travels between the two servers (Web and requestor) is written using Hypertext Markup Language (HTML). Warehouses are concerned with real data, not text documents. The Common Gateway Interface (CGI) facility of the Web server software provides a way of executing server resident software, such as a SELECT statement, that accesses a relational database. Building secure applications for the Internet requires a well-thought-out security strategy as well as the appropriate application architecture. Most Web applications provide all users with the same access permissions. The information available is either not confidential or of a low level of confidentiality. The same security issue currently exists at the database level. Note: As noted in the bottom slide for Common Web Data Warehouse Architecture, the communication mechanism between the OLAP server and Web server can either be any one of the following mechanisms: • Common Gateway Interface (CGI) • Object Request Broker Cartridge • Servlets • Netscape Server API (NSAPI) • Internet Server API (ISAPI) • Other compatible mechanisms ..................................................................................................................................................... Data Warehousing Fundamentals 16-9 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Issues in Deploying a Data Warehouse on the Web • Security: – Authentication and authorization – Communication confidentiality – Access and restriction management • • Scalability Availability Copyright Oracle Corporation, 1999. All rights reserved. ® Security Authentication and authorization: – Password – Digital certificates – Authentication tokens Communication confidentiality Access and restriction management Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-10 Data Warehousing Fundamentals Issues in Deploying a Data Warehouse on the Web ..................................................................................................................................................... Issues in Deploying a Data Warehouse on the Web Security The Computer Emergency Response Team (CERT), an Internet security watchdog organization, calculates the number of security incidents reported to the center has grown dramatically, from less than 100 in 1988 to almost 2,500 in 1995. The leakage of data warehouse information through unauthorized access by hostile parties can have extremely serious legal, financial, and competitive impacts on an organization. This is because of access to processed information such as summarized data, trend analysis, and confidential reports used to make business decisions. Such leakage may also not be detected. Security is thus of utmost importance to the data warehouse manager. To address the security needs, the data warehouse manager needs to pay attention to authentication and authorization, communication confidentiality, and access and restrictions management. Authentication and Authorization According to CERT: “Authentication is proving that a user is who he or she claims to be. That proof may involve something the user knows (such as a password), something the user has (such as a smart card), or something about the user that proves the person’s identity (such as a fingerprint). Authorization is the act of determining whether a particular user (or computer system) has the right to carry out a certain activity, such as reading a file or running a program. Authentication and authorization go hand in hand. Users must be authenticated before carrying out the activity they are authorized to perform.” (CERT, Security of the Internet (Web version). February 1998.) There are three means for a user to authenticate himself or herself: • Something the user knows, such as a PIN or reusable password • Something the user has, such as a smart card • Something specific to the user, such as his or her palm print or voice The three most widely used ways are: • Password: It consists of a string of characters and is the most basic security measure. Unfortunately, the same password is often used to access different systems and can be captured or stolen. It is better to use onetime passwords. • Digital certificates: An electronic certificate that identifies users to ensure the successful and authorized transfer of information. The certificate identifies its owner to someone who needs proof of the bearer’s identity. ..................................................................................................................................................... Data Warehousing Fundamentals 16-11 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Security Authentication and authorization: – Password – Digital certificates – Authentication tokens Communication confidentiality Access and restriction management Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-12 Data Warehousing Fundamentals Issues in Deploying a Data Warehouse on the Web ..................................................................................................................................................... Security (continued) Authentication and Authorization (continued) • Authentication tokens: These are small one time password calculators with a display and sometimes a keypad. Some examples of authentication tokens are smart cards, thumbprint biometric scanning, and retinal pattern biometric scanning. More advanced security technologies employ at least two of these three factors of user authentication and identification. Factor one is a memorized personal identification number; factor two is a smart card with its displayed code generated at a programmed interval. The two factors combine to produce a onetime password. Communication Confidentiality Ensure that third parties cannot eavesdrop on communications or impersonate communicating parties. Data that is traversing the Internet should not be readable to unauthorized parties. Encryption, which is the transformation of data into a form unreadable to anyone without a suitable decryption key, is often used to protect data confidentiality. The transformation of data into a form unreadable by anyone without a decryption key. The two most widely-used types of encryption are symmetric key encryption and public key encryption. In symmetric encryption, the same key is used to encrypt and decrypt the message. Therefore both the sender and receiver must somehow acquire the key before confidential communication can proceed. This distribution of the key is a point of vulnerability, and if improperly done, the communication can be compromised. With public key encryption, one key is used to encrypt and a second different key is used to decrypt. The first key cannot decrypt the message and can be sent from the recipient to the sender or even made public. The sender uses this key to encrypt the message for the recipient. This ensures confidentiality in communication but not authentication of the sender. To provide both authentication and communication confidentiality, you can use digital certificates based on public key encryption. A trusted third party authenticates both parties by some reliable method and issues them digital certificates. Access and Restriction Management There should be some way to determine across the enterprise whether a particular party has certain privileges or access to valuable resources. When access and restriction management is not controlled in a unified manner there is a possibility that certain parties may still have authorized access even though that is not desired. A directory server is often used as a single point of access and a single point of authentication. Other access management tools are routers and firewalls. Routers can be configured to restrict the flow of network packets to selected portions of the network based on message origin and destination. ..................................................................................................................................................... Data Warehousing Fundamentals 16-13 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Security Authentication and authorization: – Password – Digital certificates – Authentication tokens Communication confidentiality Access and restriction management Copyright Oracle Corporation, 1999. All rights reserved. ® Scalability • Main concerns are: – Amount of data – Complexity of queries – Number of areas – Number of users • Potential bottlenecks are: – Storage capacity – Memory – Computational cycles – Limits on OS resources – Network bandwidth Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-14 Data Warehousing Fundamentals Issues in Deploying a Data Warehouse on the Web ..................................................................................................................................................... Security (continued) Access and Restriction Management (continued) Firewalls restrict the flow of traffic from one network to another based on protocols. Firewalls often include the capabilities of routers. In addition, a firewall can include the capabilities of a proxy server and make requests to external computers on behalf of internal network computers. This hides from the users the configuration of the internal network, such as the name, IP addresses, and OS of internal computers. Scalability When enterprises that serve a large population offer service over the Internet, they face unpredictable demands. In particular, they may have to handle peak and special demand loads. With many business and government organizations there are potentially thousands, if not millions of online users. Data warehouse demands also tend to grow rapidly over time. Web-based access to data warehouse might need a high order of scalability as well. This means that the system has to be parsimonious in the use of computing resources per user and should be incrementally extensible through the addition of computing resources. The main concerns for data warehouse scalability over the Web are: • The amount of data • The complexity of queries • The number of areas • The number of users The amount of data that is stored in a data warehouse is substantially greater than for most operating databases and continues to grow with time. Anthem’s data warehouse for example began with 1.3 TB of data and anticipated to grow by 10 times more in three years. Because users are looking for trends and comparing data, it is typical for large amounts of data to be sent to the user per request. The potential bottlenecks are in: • Storage capacity • Memory • Computational cycles • Limits on operating system resources such as file handles, ports, and locks • Network bandwidth Scalability issues should be considered from the beginning to handle both current needs and future growth. It may be difficult or impossible to make a nonscalable system scalable after implementation. However, it is more cost-effective if resources can be incrementally added only as needed, as growth occurs rather than all at once. ..................................................................................................................................................... Data Warehousing Fundamentals 16-15 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Availability • The Internet extends the reach of database applications throughout the enterprise, organizations, and communities. • More and more data warehouses require 24 X 7 availability • Maintenance windows for batch extract, process, and refresh information for the data warehouse are shrinking. Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-16 Data Warehousing Fundamentals Issues in Deploying a Data Warehouse on the Web ..................................................................................................................................................... Availability The Internet extends the reach of database applications throughout the enterprise, organizations, and communities. This reach further highlights the importance of high availability in data management solutions. Small business and global enterprises alike have customers all over the world requiring access to data 24 hours per day and 7 days a week. This is true of many large operational systems but is also becoming the case for data warehouses. One consequence is that maintenance windows are shrinking or disappearing. Secondly, failure in one part of the system does not necessarily make the entire system unavailable. Maintenance windows are typically used to batch extract, process, and refresh information for the data warehouse. In the future it becomes important to be able to perform such maintenance operations on the data warehouse while it is online. This covers everything from adding disk packs, computers, and data files, to cleaning and refreshing the data from the operational system; to performing backup, archiving, and recovery operations. ..................................................................................................................................................... Data Warehousing Fundamentals 16-17 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Evaluating Web-Based Tools Requirements: • Interactivity Does the tool provide interactivity that covers tables, charts, and quadrants? • Functionality Calculations, SQL generation, formatting, navigation techniques, layout controls Copyright Oracle Corporation, 1999. All rights reserved. ® Evaluating Web-Based Tools Requirements: • Architecture What generation of Web architecture does the tool require? • Performance – How quickly can users access the data they need? – How long does it take to download dynamic client-side programs? – What trade-off does the tool make between interactivity and performance? Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-18 Data Warehousing Fundamentals Evaluating Web-Based Tools ..................................................................................................................................................... Evaluating Web-Based Tools Requirements Wayne Eckerson, from the Patricia Seybold Group, outlined the following requirements for evaluating Web-based query and analysis tools. Requirement Interactivity Functionality Architecture Performance Specific Questions to Ask Does the tool provide interactivity that covers tables, charts, and quadrants? Note: Most tools provide static viewing capabilities. Compare the functionality of the Web-based tool to the functionality of its client-server-based version in the area of: • Calculations • SQL generation • Formatting • Navigation techniques • Layout controls Note: The Web-enabled tool must meet the requirements of your target audience. It is important to consider what generation of Web architecture the tool requires. Specifically consider: • Does it support a four-tier architecture using CGI interfaces or native Web server interfaces? • Does it support a three-tier architecture using Java client and server and proprietary client-server protocols? • Does it use Java applets, ActiveX controls, plug-ins, or helper applications? • How closely is the tool tracking emerging Internet and Web standards? A tool that uses native Web server interfaces will run faster in a multiuser environment than tools that use CGI. Consider the following: • How quickly can users access the data they need? • How long does it take to download dynamic client-side programs? • What trade-off does the tool make between interactivity and performance? ..................................................................................................................................................... Data Warehousing Fundamentals 16-19 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Evaluating Web-Based Tools Requirements: • Design Does the tool require designers to do coding in HTML or CGI scripts to create sophisticated HTML reports? • Administration Does the tool control access to reports by user, group, and role? • Output Can the tool output data in a variety of formats and languages? Copyright Oracle Corporation, 1999. All rights reserved. ® Evaluating Web-Based Tools Requirements: • Scalability – What platforms does the tool’s main execution engine run on? – Does it support load-balancing? • Databases What databases and native drivers does the tool support? • Pricing – How much does the tool cost? – Does the tool support Web pricing? Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-20 Data Warehousing Fundamentals Evaluating Web-Based Tools ..................................................................................................................................................... Requirements (continued) Requirement Design Administration Output Scalability Databases Pricing Specific Questions to Ask Does the tool require designers to do coding in HTML or CGI scripts to create sophisticated HTML reports with drill-down, pivots, and embedded links? Note: Design is an important factor to consider. Most tools use their existing client-server tools to build reports, which are then published in HTML. However, it is important to know what gets lost in the translation. The tool must be able to control access to reports by user, group, and role. After users log on to the Web server, they should be presented with a custom menu that shows only those reports that they are authorized to access. Some of the questions to consider are: • Does the tool have a utility for managing a great many report files on a Web server? • How does it control user access to reports? • Does it work with existing security features of application servers and database server? A good tool will generate HTML for wide-based distribution as well as reports in native proprietary format for use with helper applications. Advanced tools should also generate Java for display within a Java window. Specifically consider: • Can the tool output data in a variety of formats such as grid, crosstab, and chart and in a variety of languages such as HTML, Java, and Excel? • Which release of HTML does the tool support? • What platforms does the tool’s main execution engine run on? • Does it support load-balancing? • What databases does the tool support? • Does it support both relational and OLAP databases? • Does it use native drivers such as ODBC and JDBC? • Does it support text? • How much does the tool cost? • Does it support a Web pricing model? Note: Many companies are starting to charge by concurrent user and the size of the server machine rather than by per-seat charges and flat-fee server pricing. (Patricia Seybold Group, Wayne Eckerson, Web-Based Query Tools and Architecture. March 1997) ..................................................................................................................................................... Data Warehousing Fundamentals 16-21 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Summary This lesson covered the following topics: • Highlighting the main benefits of Web-enabling the data warehouse • Discussing the main issues in deploying a data warehouse on the Web • Specifying the requirements for Web-based tools Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-22 Data Warehousing Fundamentals Summary ..................................................................................................................................................... Summary This lesson covered the following topics: • Highlighting the main benefits of Web-enabling the data warehouse • Discussing the main issues in deploying a data warehouse on the Web • Specifying the requirements for Web-based tools ..................................................................................................................................................... Data Warehousing Fundamentals 16-23 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Practice 16-1 Overview This practice covers the following topics: • Completing the Web-based tool requirement checklist • Justifying each response Copyright Oracle Corporation, 1999. All rights reserved. ® ..................................................................................................................................................... 16-24 Data Warehousing Fundamentals Practice 16-1 ..................................................................................................................................................... Practice 16-1 Web-Based Tool Requirement Checklist For each item in the following list which evaluates Web-based tools requirements, rate your own organization’s needs and requirements. Rate each item’s relative importance in measuring your organization’s needs and requirements. Requirement Specific Questions to Ask Interactivity Does the tool provide interactivity that covers tables, charts, and quadrants? Compare the functionality of the Webbased tool to the functionality of its client-server-based version in the area of: • Calculations • SQL generation • Formatting • Navigation techniques • Layout controls • Does it support a four-tier architecture using CGI interfaces or native Web server interfaces? • Does it support a three-tier architecture using Java client and server and proprietary client/server protocols? • Does it use Java applets, ActiveX controls, plug-ins, or helper applications? • How closely is the tool tracking emerging Internet and Web standards? • How quickly can users access the data they need? • How long does it take to download dynamic client-side programs? • What trade-off does the tool make between interactivity and performance? Functionality Architecture Performance Is This Important to You? Why? ..................................................................................................................................................... Data Warehousing Fundamentals 16-25 Lesson 16: Web-Enabling the Warehouse ..................................................................................................................................................... Web-based Tool Requirement Checklist (continued) Requirement Specific Questions to Ask Design Does the tool require designers to do coding in HTML or CGI scripts to create sophisticated HTML reports with drilldown, pivots, and embedded links? • Does the tool have a utility for managing a great many report files on a Web server? • How does it control user access to reports? • Does it work with existing security features of application servers and database server? • Can the tool output data in a variety of formats, such as grid, crosstab, and chart, and in a variety of languages, such as HTML, Java, and Excel? • Which release of HTML does the tool support? • What platforms does the tool’s main execution engine run on? • Does it support load-balancing? • What databases does the tool support? • Does it support both relational and OLAP databases? • Does it use native drivers such as ODBC and JDBC? • Does it support text? • How much does the tool cost? • Does it support a Web pricing model? Administration Output Scalability Databases Pricing Is This Important to You? Why? ..................................................................................................................................................... 16-26 Data Warehousing Fundamentals 17 ................................. Managing the Data Warehouse Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Meeting a Business Need Choosing a Computing Architecture Planning Warehouse Storage Modeling the Data Warehouse ETT (Building the Warehouse) Analyzing User Query Needs Managing Managing the the Data Data Warehouse Warehouse Supporting End User Access Project Management (Methodology, Maintaining Metadata) Copyright Oracle Corporation, 1999. All rights reserved. Objectives After completing this lesson, you should be able to do the following: • Develop a plan for managing the transition from development to implementation • Identify challenges pertaining to the growth of the data warehouse • • Describe backup and archive mechanisms Identify data warehouse performance issues Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-2 Data Warehousing Fundamentals Overview ..................................................................................................................................................... Overview This lesson explores the management issues, critical success factors, and challenges to successful data warehouse implementation. The lesson addresses issues pertaining to the management of the entire warehouse life cycle. Note that the “Managing the Data Warehouse” block is highlighted in the overview slide on the facing page. Objectives After completing this lesson, you should be able to do the following: • Develop a plan for managing the transition from development to implementation • Identify challenges pertaining to the growth of the data warehouse • Describe backup and archive mechanisms • Identify data warehouse performance issues ..................................................................................................................................................... Data Warehousing Fundamentals 17-3 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Managing the Transition to Production • • • • • • • Promoting support for change Pilot versus large-scale implementation Documentation Testing Training Postimplementation support Maintaining the warehouse Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-4 Data Warehousing Fundamentals Managing the Transition to Production ..................................................................................................................................................... Managing the Transition to Production Another set of key management issues surrounds the transition from warehouse development to production. These issues include: • Promoting the support of management, developers, and end users for the changes accompanying the warehouse • Choosing between a manageable pilot and large-scale implementation • Documentation • Testing • Training • Postimplementation support • Maintaining the warehouse ..................................................................................................................................................... Data Warehousing Fundamentals 17-5 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Promoting Support for Change Management Developers To Support Not to Support Competitiveness Fear of change Business benefit Risk avoidance New skills Outdated skills Leading edge End Users Faster flexible system Disruption Improved tools Change Increased workload Copyright Oracle Corporation, 1999. All rights reserved. Methods for Promoting Support • • • • • • • Awareness Feedback Information Skills Education Direction Control Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-6 Data Warehousing Fundamentals Managing the Transition to Production ..................................................................................................................................................... Promoting Support for Change Unfortunately, not everyone easily tolerates or accepts the introduction of new systems and associated technologies. End users and information systems personnel, particularly, are often bombarded with new systems and technology. There are reasons why staff may be either for or against supporting the warehouse. Reasons to Support Reasons not to Support Management Competitive advantage Benefit from the investment Fear of change Risk avoidance Developers Opportunity to learn new and valuable skills Leading-edge technology Fear of obsoleting old skill set End Users Faster and more flexible systems Improved and more powerful query tools Disruption of routine Change of toolset Increased workload Methods for Promoting Support Given the fears identified, there are ways you can control transition to this new and exciting but challenging environment. Some of these may be obvious; however, they are worth stating. • Ensure that everyone is aware of the benefit the warehouse is going to bring to the business. A profitable organization is able to grow, compete, adapt, and keep staff. • Ensure that all staff involved in the warehouse project are aware of what is happening at each stage. Provide constant and consistent feedback on status, including problems and successes. • Ensure that the IT staff are trained with the skills they need (old and new). • Provide users with the training necessary to use the query tools effectively and imaginatively. • Keep the project on course. Do not let any phase of development skip without understanding why, and learn for the next increment. Monitor progress constantly. ..................................................................................................................................................... Data Warehousing Fundamentals 17-7 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Choosing Between Pilot and Large-Scale Implementation Large-Scale Implementation Pilot Copyright Oracle Corporation, 1999. All rights reserved. The Warehouse Pilot • Demonstrates benefits to: – Management – Users – IT staff • • • • • • Relevant to the business Low technical risk Small and feasible Anticipates increased use Focused on an initial business issue Remains in context Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-8 Data Warehousing Fundamentals Managing the Transition to Production ..................................................................................................................................................... Choosing Between Pilot and Large-Scale Production This choice should have been already made at an earlier planning stage. The preferable choice is a pilot, the success of which can be leveraged into further incremental rollouts. The Warehouse Pilot The pilot demonstrates benefits to management, end users, and IT staff. Management the business. The warehouse can provide current and ongoing financial benefits to End Users The types of information available, the flexibility of the tools, and the type of analysis possible. IT Staff Whether their strategy and development plans were appropriate. Changes can be made prior to developing the next increment. Essential considerations for the pilot are to: • Ensure that the subject matter chosen is relevant to the business. Thus, the pilot may focus on an initial business issue such as sales or marketing. • Have a low technical risk by starting small and feasible. It may be that the pilot data comes from a single relational source and therefore is most likely to succeed as a proof of concept. Further iterations may extract data from diverse sources. • Anticipate significant use. • Ensure that the pilot, however small, remains within the context of the larger vision. ..................................................................................................................................................... Data Warehousing Fundamentals 17-9 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Piloting the Warehouse • • • Designers Prove model, data, and access tools Users – Prove ease of use of tool – Check data and query performance – Identify training requirements Developers – Resolve ETT and metadata issues – Determine users data and training requirements – Test security and access levels, monitor performance Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-10 Data Warehousing Fundamentals Managing the Transition to Production ..................................................................................................................................................... Piloting the Warehouse You can position the pilot, or prototype, as the starting point of the iterative warehouse process mentioned earlier. It is a vital part of the implementation. The pilot must cover all aspects of implementation and ensure user involvement at every step in the process or phase of the life cycle. A specific subject area of the warehouse is targeted for the pilot, and the query tools selected should be available to the users for data access. The pilot fulfills a number of tasks, including those in the following list: • It enables the designers to prove the model, the data, and the access tools. • It enables the users to: – See how easy the access tool is to use – Enhance their data requirements – Identify their training requirements – Measure query performance • It allows the developers to: – Determine whether the ETT process is adequate and modify it accordingly – Identify any issues with the metadata presented to the users or used by ETT – Determine the users’ near future and possibly even long-term requirements – Identify and define the users’ training needs – Test access levels and security of the systems and data – Monitor performance Several things must be agreed upon before piloting: • You must ensure that acceptance criteria are documented and agreed upon. • You need to identify volume and scalability tests and develop a test plan with test cycles. • Once the tests are executed, you can gather statistics on performance and optimize where necessary. You must test the entire process of refreshing the data, and produce a report that contains a complete and detailed evaluation of this proof of concept. ..................................................................................................................................................... Data Warehousing Fundamentals 17-11 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Documentation Produces textual deliverables: • • • • • • Glossary User and technical documentation Online help Metadata reference guide Warehouse management reference New features guide Copyright Oracle Corporation, 1999. All rights reserved. Testing the Warehouse • • Test every stage Use a realistic test database and environment Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-12 Data Warehousing Fundamentals Managing the Transition to Production ..................................................................................................................................................... Documentation This process centers on producing all user and technical documentation for the data warehouse, including references, user and system operations guides, and online help. Metadata Reference Guide To ensure active and successful use of the warehouse, the metadata reference guide describes the contents of the data warehouse in business terms and provides a navigational road map to the contents of the warehouse. Warehouse Management Reference The warehouse management documentation outlines the workflow and procedures (both manual and automated). New Features Guide The new features guide highlights any enhancements to warehouse functionality that results from the implementation of the solution. Testing the Warehouse Do not assume, “No problem, it will work.” Always test components. Test Database Testing is required at every stage of development, involving every component, ideally on a test database, using a machine and network setup as close as possible to the planned production environment. If you are using Oracle Data Warehouse Method, testing is a specific requirement during most phases and for many tasks. ..................................................................................................................................................... Data Warehousing Fundamentals 17-13 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Training • • Users – Metadata – DSS tools – Ad hoc queries – Getting help – Registration of enhancement requests Information systems developers: – Analysis techniques – Hardware technicalities – Networking – Implementing, building, and supporting DSS Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-14 Data Warehousing Fundamentals Managing the Transition to Production ..................................................................................................................................................... Training During project planning, allocate time and resources to educating key information technology staff, end users, and management personnel about data warehousing and its benefits. Education begins at the start of development and continues right through to the end and on to further iterations. Educating Users Educating users on how to access data is one of the most critical areas of warehouse training. Always ensure that representatives from each user group are invited to courses and workshop sessions. Users need to know how: • The metadata represents the business data • To use the decision support tools to answer business questions • To create ad hoc queries and save data results • To contact the help desk or support group for assistance • To register requests for enhancements through a formal change management process Educating IT Staff Information systems staff need education in the following areas: • How to communicate and understand people issues • Business analysis techniques • Technical aspects of the hardware architecture • The network environment • Decision support and OLAP tools—implementing, building, and supporting Educating everyone involved with the warehouse is more critical for the first implementation. Everyone must be made aware of what the warehouse is, even if they are not directly involved with the project. ..................................................................................................................................................... Data Warehousing Fundamentals 17-15 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Postimplementation Support • • Evaluate and review the implementation Monitor the warehouse: – Respond to problems – Conduct performance tuning – Roll out metadata, queries, reports, filters, and conditions – Implement security – Incorporate new users – Distribute data marts and catalogs – Transfer ownership from IT Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-16 Data Warehousing Fundamentals Managing the Transition to Production ..................................................................................................................................................... Postimplementation Support This process provides an opportunity to evaluate and review the implementation. You access metadata and evaluate queries and reports run against the warehouse. The information assists with managing standard queries and reports and the user layer and identifies required indexes. Monitoring the Warehouse After implementation, you will need to monitor the warehouse continuously to manage the following: • Monitoring and responding to system problems • Conducting performance and tuning activities for all components of the data warehouse • Rolling out metadata, queries, reports, filters, and conditions • Implementing security • Incorporating new users • Distributing data marts and catalogs • Transferring ownership (responsibility for the data warehouse may be transferred from IT personnel to the owning organization.) ..................................................................................................................................................... Data Warehousing Fundamentals 17-17 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Managing Growth Expanding user numbers Number of Users 300 250 200 150 100 50 24 Months 12 Months 6 Months 3 Months Initial 0 Period after Implementation Source: Data Warehouse Institute Flash Report, January 1996 Copyright Oracle Corporation, 1999. All rights reserved. Types of Growth • • • Increasing number of users Broader usage Growth of data volumes Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-18 Data Warehousing Fundamentals Managing Growth ..................................................................................................................................................... Managing Growth The table below is the result of a survey showing that the number of users accessing the successful warehouse grows substantially during the first two years. You can see that between 12 and 24 months there is a substantial rise in use. Number of Users Actively Querying the Warehouse Period Large and small sites Small sites Initial Number 16 6 After 3 months 19 12 After 6 months 44 20 After 12 months 99 28 After 24 months 255 55 Once the benefits of the warehouse become tangible to the user community, demands on the warehouse increase dramatically. The table and chart are sourced from the Data Warehouse Institute Flash Report, January 1996. Types of Growth • Increasing number of users • Broader, more varied usage • Growth of data volumes The database increases in size through the accumulation of historical data and addition of new subject areas. Warehouse usage increases through the availability of new decision support functionality and evolving empowerment of the user population. ..................................................................................................................................................... Data Warehousing Fundamentals 17-19 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Expansion and Adjustment • • • Evaluate continually: – Changes – New increments – Unnecessary components – Strategies Ensure open environment Document development processes for the future: – Planning – Cost analysis – Problem assessment and correction – Performance assessment Copyright Oracle Corporation, 1999. All rights reserved. Controlling Expansion Control by • • • Ensuring the continuity of staff • Creating a strategy for maintaining changes to data Documenting processes, solutions, and metrics Establishing working test and production architecture for further increments Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-20 Data Warehousing Fundamentals Managing Growth ..................................................................................................................................................... Expansion and Adjustment Continually evaluate the warehouse to identify: • Changes that can be made • Additional increments (although this is usually identified in the primary strategy phase of development) • Components that may be removed (for example, unused summaries) • Optimal indexing and performance strategies Openness for the Future An open architecture and toolset is required to suit current and future requirements. Document for the Future You should document the process used in developing the data warehouse solution and collect metrics, as an aid to: • Future planning • Further and future cost analysis of current or new projects • Identification of errors and inadequacies that can be eliminated for the next project • Assessing tool performance Note: The DWM Transition to Production Phase creates tasks for these postimplementation issues. The Discovery Phase evaluates all warehouse components. Controlling Expansion To control the expansion and adjustment process, and to promote its success, you should: • Ensure the continuity of staff on warehouse projects • Document the process used in developing the warehouse solution and metrics • Establish a working test and production architecture that can be used for further increments As organizational structures change, the historical data reflects a different story. Determine a strategy for managing changes to the data. ..................................................................................................................................................... Data Warehousing Fundamentals 17-21 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Sizing Storage • • • • • • Consider different methods Determine the best for your needs Know the business requirements Do not underestimate requirements Plan for growth Consider space for unwanted data Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-22 Data Warehousing Fundamentals Managing Growth ..................................................................................................................................................... Sizing Storage Sizing data storage (or capacity planning) takes place at a number of stages, for each increment of the data warehouse solution. It is often revised before being finalized. Sizing must take into account all the object space needed, not just the database itself with the warehouse data. Do Not Underestimate Capacity planning is an art in itself. There are many objects for which space must be accurately estimated, such as tables, indexes, logs, sort areas, and temporary space. You may think that this is not much different from the operational system; however, with the warehouse you are looking at very large databases with very large space requirements. It is all too common, when sizing, to forget these additional objects. Planning for Growth In addition, your early planning stages must consider the growth of these areas. The data warehouse grows exponentially once implemented, at every refresh cycle, and space must be available for that growth. Removing Unwanted Data When data is not needed, it is either purged (removed and never used again) or archived (for possible later use). Consider the space and location of archive data. Pay careful attention to determining the storage requirements for the warehouse. This includes space for: • Data—fact, dimension, reference, and summary • The staging file store • Indexes • Backup and recovery strategies • Temporary files ..................................................................................................................................................... Data Warehousing Fundamentals 17-23 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Estimating Storage • • • • • • Fact volumes Fact lifetime Technology availability Technology purchase Storing pre-summarized data Mirroring or other techniques requiring disk storage Copyright Oracle Corporation, 1999. All rights reserved. Objects That Need Space • • • • • • • • ODS Indexes and metadata Summary data Redo logs Rollback information Sort areas Temporary space Workspace for backup and recovery Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-24 Data Warehousing Fundamentals Managing Growth ..................................................................................................................................................... Estimating Storage In any discussion on this subject, you find a vast array of different ideas, opinions, methods, and approaches. There is no one single recommendation. You need to consider the different approaches that are possible, choose the best for you and your data warehouse, and keep it simple. You should never underestimate the amount of space needed in the data warehouse. In order to estimate accurately, you need to answer some simple questions about your data: • What is the expected volume of core fact data? • What is the lifetime of core fact data? • Do you have the technologies to support that volume? • If not, do you need to purchase the technologies? • How important is storing pre-summarized data? • Does your recovery strategy involve mirroring or other techniques requiring disk storage? Objects That Need Space A detailed understanding of available data is essential for planning capacity at an early stage. Capacity planning is ongoing throughout the life of the warehouse. Consider disk requirements: • Intermediate data store (This is sometimes implemented as an Operational Data Store (ODS) and referred to as a staging area. It holds data that has been extracted from source systems, prior to being loaded into the warehouse.) • Indexes, of which there may be many more than in normal operational systems • Metadata that contains the map to the warehouse structure and content • Summary data that comprises aggregated data • Redo logs and rollback information • Sort areas and temporary space • Load files moved to the server • Workspace for backup and recovery ..................................................................................................................................................... Data Warehousing Fundamentals 17-25 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Test Load Sampling • Analyze statistically significant data samples • • • Use test loads for different periods Reflect day-to-day operations Include seasonal data and worst case scenarios – Calculate number of transactions – Employ average sales price approach Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-26 Data Warehousing Fundamentals Managing Growth ..................................................................................................................................................... Test Load Sampling You have to decide on the capacity planning technique that suits you best. You may already have a method that is successful for your operational environment and can be enhanced for VLDBs and other warehouse objects, such as ODSs. A good approach to sizing is based on the analysis of a statistically significant sample of the data. Test loads can be performed on data from a day, a week, a month, or any other period of time. Care must be taken that the sample periods reflect the true day-to-day operations of your company, and the results include any seasonality issues or other factors, such as worst-case scenarios that may otherwise prejudice the results. Once you have determined the number of transactions based on the sample, then you calculate the size by using the average sales price approach. ..................................................................................................................................................... Data Warehousing Fundamentals 17-27 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Average Sales Price Assume transaction-level grain. Total company revenue $20 billion Avg sale price per line item $5 Number of line items per year $20 billion / $5 = 4 billion Number of base fact records 4 billion x 3 yrs = 12 billion Key fields 4 (x 4 bytes) Fact fields 4 (x 4 bytes) Base fact table size 12 billion x 8 fields x 4 bytes = 385 GB Copyright Oracle Corporation, 1999. All rights reserved. Average Sales Price Use other methods: • • It is difficult to obtain an accurate average You can achieve inaccurate calculations Do not use this approach on its own Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-28 Data Warehousing Fundamentals Managing Growth ..................................................................................................................................................... Average Sales Price The following calculation shows how to estimate the amount of direct-access storage device (DASD) needed for three years’ worth of data, using an average sales price algorithm. Total company revenue Average sales price per line item on an individual customer receipt Number of line items per year for total business Number of base fact records Number of key fields Number of fact fields Total fields Base fact table size $20 billion $5 $20 billion / $5 = 4 billion 4 billion × 3 (years) = 12 billion 4 (assume 4 bytes per field) 4 (assume 4 bytes per field) 8 12 billion × 8 fields × 4 bytes = 385 GB If you take your company’s annual gross revenues and divide by the average revenue per transaction, then multiply this figure by the length of the row (key columns and data columns) in your fact table, you have the amount of DASD needed for a year's worth of data. You should never use this approach on its own; it is simplistic. The problem is that it is difficult to get the average revenue per transaction. It is unusual to have a set price point or even a relatively narrow price range for the products offered by any company. Many companies have products that sell in volume at relatively low prices, say $5, and they may have low-volume big-ticket items as well, all of which distort the average. For example, if the average used is $5, you need 385 GB of DASD, but if the average is in reality $10, you need only 192 GB of DASD. Note: This approach is one that is recommended by Ralph Kimball, and takes a business view rather than a technical view to sizing. ..................................................................................................................................................... Data Warehousing Fundamentals 17-29 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Other Techniques and Considerations • • Queuing models Rule of thumb Total database size is three to four times the size of the base fact tables • Consider: – Sparseness – Dimensions – Indexes – Summaries – Sort operational space Copyright Oracle Corporation, 1999. All rights reserved. Space Management • • • • • • Monitor Avoid fragmentation Test load data Plan for growth Know business patterns Never let space become an issue Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-30 Data Warehousing Fundamentals Managing Growth ..................................................................................................................................................... Other Techniques and Considerations Queuing Models Mathematical models can predict response time on throughput. Rule of Thumb This rule is often quoted within Oracle. Depending upon the database server and end-user tools, the total database size is three to four times the size of your base fact tables. Other Considerations • Sparseness of data in fact tables You must also consider the sparseness of data. Fact table data is generally sparse; relatively few of all the possible key value combinations are present. Summary tables are not considered sparse. That is, they contain values for every possible key value combination. • Large dimension tables • Significant increase in size of database caused by indexes • Large summary tables. Sometimes they occupy as much space as the base fact table. There may be hundreds of summary tables for a warehouse implementation. • The need for sort operational space for sorting and loading Note: You may consider using leasing and chargeback strategies for any excess storage capacity, especially in a massively parallel processor (MPP) configuration. Space Management You have determined a technique for planning capacity and are aware of the numerous objects that need space; you need to consider management of this space: • The space usage must be monitored and any fragmentation noted and resolved. • You should load test sets of data and consider careful analysis (use the ANALYZE command) to estimate average row length and rows per block, to predict whether you have sufficient capacity. • You need to consider how the database is going to grow, and plan for additional storage accordingly. Fact data grows rapidly, depending upon the refresh cycle frequency; it grows every time a refresh occurs. • Knowing the patterns within your business is key to planning these requirements. Never allow space to become an issue in a warehousing environment; you can see, with all the operations discussed, how important it is. ..................................................................................................................................................... Data Warehousing Fundamentals 17-31 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Managing Backup and Recovery • • • Business requirements for availability Fast recovery essential Strategy: – Defined – Tested – Proven – Evolving Copyright Oracle Corporation, 1999. All rights reserved. Backup Strategy • Is based on the business requirements and the cost benefit • Involves large volumes of data: – All objects except temporary tablespaces – Incremental • Includes first-time load and refreshes Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-32 Data Warehousing Fundamentals Managing Backup and Recovery ..................................................................................................................................................... Managing Backup and Recovery Availability Availability is the key requirement of mission-critical data warehouses; recovering after any type of failure must happen fast. Some companies demand round-the-clock availability, making partial recovery mechanisms imperative. Backup Strategy To recover the database quickly implies that there is a welldefined, tested, and proven backup strategy, as well as a disaster recovery strategy in case of fire, flood, or infestation. Evolving Strategies Ensure that as the warehouse evolves, the backup and recovery strategies also evolve synchronously. Test the backup and recovery procedures constantly to ensure that they are relevant to your current environment. The strategy you deploy is based on the business requirements and the cost benefit. The strategy is not just when and what to back up, but what tools and utilities you are going to use. Backing up data is different in the data warehouse environment. You are dealing with much larger volumes of data than operational systems and higher availability requirements. What to Backup Everything in the data warehouse must be the subject of backup, except temporary tablespaces; that is, the data and tables, metadata, indexes, constraints, stored procedures, and triggers. When to Backup A critical part of your overall strategy is to determine when the database needs to be backed up. This is no different from an operational environment, except that the frequency of changes to data is unlikely to be as great as that in the operational environment. You should back up after the first-time load, after incremental refreshes, and after any changes to the database structure, such as adding fact or summary tables. Incremental backups are used, because the data is static between loads. You need to outline the strategy to include full and incremental backups as you would in an operational environment. ..................................................................................................................................................... Data Warehousing Fundamentals 17-33 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Defining the Strategy • • Mission-critical systems SLAs: – Defined downtime – Acceptable MTBF • • Efficient backup and recovery Evaluation of different technologies Copyright Oracle Corporation, 1999. All rights reserved. Planning for Backup • • • Plan at the design stage Use hot backups for VLDBs Back up necessary components: – Fact and dimension data – Warehouse schema – Metadata schema – Metadata • Export/Import utility: – Disk space – Time Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-34 Data Warehousing Fundamentals Managing Backup and Recovery ..................................................................................................................................................... Defining the Strategy All backup and recovery strategies and tasks must outline and mirror the fact that the data warehouse contains valuable mission-critical data. A service level agreement (SLA) is drawn up between yourselves and the customer (the users) in the early stages. The SLA should at least define what downtime means (each user may have a different perspective on this) and the acceptable mean-timebetween-failure (MTBF) figures. The backup hardware environment must be as efficient as possible, considering the implications and technicalities of deploying RAID, striping, mirroring (some parts of the database need to be mirrored, others can employ RAID), or partitioning (backup partitions of data rather than an entire database). Planning for Backup The backup and recovery strategy for a warehouse needs to be considered at the design stage. Details such as how the data is partitioned greatly affect the strategy. For small and medium databases, daily cold backups (taken while all instances of the database are shut down) and export/import are viable backup tools. However, once you move to VLDBs, complete cold backups become difficult to fit into an overnight window. In addition, the disk space required for a complete export of a large database becomes an issue. You need to consider other strategies such as using tape or other devices. The defined backup strategy for the warehouse should allow for hot backups, where you can back up any part of the database at any time of the day, while the database instances are still active. With Oracle, this means backing up individual and active tablespaces. You should back up every component that is essential to warehouse operations; everything required to restore a working environment: fact data, dimension data, data warehouse and metadata schema, and data warehouse metadata. Export/Import The export/import utility enables an entire or part of a database to be extracted into a dump file and then imported into another database (under another owner if required). Generally, import/export of a VLDB uses too much disk space. You could use named pipes to a disk on a UNIX system to overcome space problems. However, this technique would be very time-consuming. ..................................................................................................................................................... Data Warehousing Fundamentals 17-35 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Backup Tools • • • Oracle7 Enterprise Backup Utility Oracle8 Recovery Manager Utilities: – Import and Export – Operating system – Third party Copyright Oracle Corporation, 1999. All rights reserved. Parallel Backup and Recovery • Parallel Backup Runs simultaneously from any node – Off-line – Online • Parallel Recovery Runs simultaneously from redo logs Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-36 Data Warehousing Fundamentals Managing Backup and Recovery ..................................................................................................................................................... Backup Tools Oracle7 Enterprise Backup Utility (OEBU) This provides a user-friendly interface, documentation, and the recording of backup details in a recovery catalog. Oracle8 Recovery Manager (RMAN) Oracle8 Recovery Manager creates image backups and incremental backups. RMAN stores the information from multiple data files (or archive logs, but not both) in a backup set, stored in a format that cannot be processed directly (similar to the Export.dmp file principle). RMAN performs either cold or hot backups. OEBU and RMAN are very useful in the VLDB environment to ensure that tasks occur without error. Utilities • Oracle Import and Export • Operating system utilities, such as UNIX cpio or tar commands, VMS EXCHANGE, and Windows NT ocopy73.exe or ocopy80.exe • Third-party utilities that provide a user-friendly layer over operating system backups Parallel Backup and Server Parallel Backup With parallel operations, backups can be performed simultaneously from any node of a parallel server. • Online backups enable the database to be backed up while active, allowing users continuous access. • Offline backups enable the database to be backed up while shut down, preventing user access. Parallel Recovery The goal of parallel recovery is to employ I/O parallelism to reduce the elapsed time required to perform crash recovery, instance recovery, or media failure recovery. The server uses one process to read files sequentially and dispatch redo information to several recovery processes to apply the changes from the log files to the data files. ..................................................................................................................................................... Data Warehousing Fundamentals 17-37 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... System Failures • • • • Process Database instance Media Natural disaster Failures are costly Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-38 Data Warehousing Fundamentals Managing Backup and Recovery ..................................................................................................................................................... System Failures Obviously, system failure can be very costly in a warehouse environment. The causes fall into four categories: Process Failure Strict and rigorous testing of your plans should prevent this situation occurring on a regular basis; however, you cannot afford to ignore the fact that it may happen. Identify an approach to monitoring processes and detecting errors and a mechanism for reapplying the failed processes. Database Instance Failure Instance failure occurs when the Oracle SGA and background processes cannot work. Failure is typically caused by: • Hardware problems such as power failure • Software problems such as an operating system crash (hanging) In an instance failure, data in buffers not yet written to disk will be lost. Media Failure Media (disk) failure occurs when errors are detected writing or reading data from disk. It is often caused by disk head crash and affects different types of file such as data files, redo logs, and control files. Media failures mean that data in buffers not yet written to disk is lost. Natural Disasters Natural occurrences such as flood and fire may result in the system becoming unusable. ..................................................................................................................................................... Data Warehousing Fundamentals 17-39 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Disaster Recovery Requirements • • • • • • • Replacement or standby machine Tape and disk capacity Communication links to users and data Copies of software Database backup Administration and operations staff Documentation Copyright Oracle Corporation, 1999. All rights reserved. Disaster Recovery Planning • • • • • • Establish the strategy Prepare the strategy Maintain the strategy Audit the strategy Test recovery plan regularly Gain approval from users Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-40 Data Warehousing Fundamentals Managing Backup and Recovery ..................................................................................................................................................... Disaster Recovery Protecting your investment is of the highest consideration. A disaster occurs when a major site loss takes place; usually the site has been destroyed or damaged beyond immediate repair. Requirements Recovering from disaster requires the following facilities: • A replacement, or standby, machine It does not have to be as large as the main machine but must have sufficient capacity to run a minimal system and the power to allow the recovery to take place on a meaningful timescale. • Sufficient tape and disk capacity to perform the recovery on a reasonable timescale Having sufficient disk space to run the minimum independent system is not always enough. You may need extra disk capacity to allow initial recovery to happen on a reasonable timescale. • Communication links to and from users and data owners • Communication links to and from data sources If the system is to be accessible to users, the communication links they need to access the machine must be in place. The links must have sufficient bandwidth and capacity. This is particularly important if the links are already in use by other systems. There is no point in putting a disaster system in place if the users cannot use it. • Copies of all relevant pieces of software and licensing agreements • Backup of database • Application-knowledgeable systems administration and operations staff, along with current documentation in written or electronic format Planning You should thoroughly test the disaster recovery plan on a regular basis, say every six months. New versions of systems, software, and data are constantly being added and the frequency of the test must take into account these ongoing changes. The strategy is normally audited: you need someone to establish, prepare, and maintain the strategy. The plans must be approved by the business and information systems users. ..................................................................................................................................................... Data Warehousing Fundamentals 17-41 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Archiving Data • • • • Determine data life expectancy Identify archive frequency Use read-only tablespaces Plan and design into early specifications Copyright Oracle Corporation, 1999. All rights reserved. Purging Data • Reduce data volumes: – Create summaries. – Remove unwanted base data. • Choose the most effective method. Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-42 Data Warehousing Fundamentals Managing Backup and Recovery ..................................................................................................................................................... Archiving Data The warehouse design needs to estimate and accommodate the data life expectancy. Establish how long you want to hold data before removing it completely from the live database. You may be required to archive old data to tape, or to another database. In small and medium warehouse databases, the amount of data involved is generally small; in larger databases the data volumes involved may be significant. Read-only Tablespaces Data warehouse databases, due to their size, require a backup method that is as fast as possible and reduces the amount of data to be backed up. You should use partitioned read-only tablespaces that enable you to archive the tablespace while it is read-only mode. • You do not have to back up a read-only tablespace after making the first backup. • Read-only tablespaces reduce the cost of archive storage. They can be stored on less expensive media such as a CD-ROM. Ensure that the device to which you are writing can be accessed quickly. • As part of your archive strategy, you can use read-only tablespaces to hold infrequently accessed data. Data archiving can impose an ongoing heavy load on the system; if you do not plan for this in the design and implementation, it can have a detrimental effect on performance. Purging Data You may be able to reduce the amount of data held by summarizing and aggregating older data. For example, you may be able to summarize data into monthly and weekly summaries at the end of each month, and then remove the detail fact data. This data should be stored offsite in case it is needed to re-create the summary files. When you remove data, always choose the most cost-effective method in terms of CPU and database resources. For example, in the case of Oracle, use the DROP table command (if the table is partitioned) rather than the DELETE command to remove the unwanted rows. The DROP command does not create rollback and redo information. ..................................................................................................................................................... Data Warehousing Fundamentals 17-43 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Improving Query Efficiency • • • • Improve database design • • Run large jobs out of hours • Use indexes Use governors Use prepared and tested queries Oracle 8i Resource Manager can guarantee resource availability to specific groups Use data marts Copyright Oracle Corporation, 1999. All rights reserved. Network Performance • • Provide sufficient bandwidth • • • • • • Identify middleware requirements Provide optimal configuration for access Know refresh volumes Consider interaction with job scheduling software Use client-side processing Deploy data marts Analyze traffic Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-44 Data Warehousing Fundamentals Identifying Data Warehouse Performance Issues ..................................................................................................................................................... Identifying Data Warehouse Performance Issues Improving Query Efficiency The basic design has many implications on performance. A poor design is never going to provide efficient access; consider redesigning the database. • Ensure that indexes exist on key values to minimize full-table scans. • Always use the SELECT command to obtain the minimum amount of data required. • Administer resource governors—query blocking—on the server and with the tools where they have governing capabilities. • Make available the use of prepared and pretested queries. • Submit large jobs out of working hours, or when CPU usage and network and I/O contention are minimum. • Oracle 8i Resource Manager can guarantee resource availability to specific groups. In addition to the above considerations, you may also consider using a data mart strategy to offload query actions to a smaller subset of the warehouse data. Network Performance The data warehouse environment is commonly distributed (a data warehouse feeding data marts), using networks to provide data transfer mechanisms. The network must be planned and set up to meet data movement and access requirements. Users should not have restricted access to data. You need to: • Ensure that the network has an appropriate bandwidth particularly for load processing. • Ensure the configuration of the environment is optimal for user access to data. • Identify whether any middleware is needed to convert data or read non-Oracle data. • Identify update frequencies and ensure the network is capable of handling the volumes. • Consider how the job scheduling software interacts with the network setup. • Use tools that perform intensive processing activities (such as summarizing and sorting) on the client side, or the server itself may perform these activities. • Deploy data marts at remote locations. Analyzing Network Traffic You should consider using tools to analyze current activity and aid in the preliminary planning of the requirements. ..................................................................................................................................................... Data Warehousing Fundamentals 17-45 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Review and Revise Monitor the warehouse: • • • • • Usage Access Accurate grain Detail data Periodicity Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-46 Data Warehousing Fundamentals Identifying Data Warehouse Performance Issues ..................................................................................................................................................... Review and Revise Once the data warehouse is in use, you should monitor it and determine the data that is being accessed, and the frequency of that access. You should also use this information to determine whether the grain of the data is right for the user requirements. Often data may have to be stored at different levels of granularity to answer sophisticated user queries. This is referred to as multiple granularity. If a user often requests simple annual sales figures for a given product, this may be satisfied with a summary table. If the user requests sales figures for a product by month, then you can provide the same information from 12 time-series tables. Of course, this involves extra processing.You need to determine early on the levels of granularity, and how long they are to remain in place in the warehouse. You should balance the issues against your requirements and resources: • How often is detail data access required? This determines the real need for details and their duration. • What are the benefits of keeping detail for a specified period? • Do the benefits outweigh the cost in machine resources? These questions, and others, can be answered in part with stringent query monitoring to give you usage information. Use this to calculate benefits against costs. ..................................................................................................................................................... Data Warehousing Fundamentals 17-47 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Secret of Success Think big start small Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-48 Data Warehousing Fundamentals Identifying Data Warehouse Performance Issues ..................................................................................................................................................... Secret of Success Your eventual goal may be the enterprisewide solution, but take small steps to achieve it. The enterprisewide warehouse is not a realistic objective for your first pass. Always use the proven low-risk incremental approach. ..................................................................................................................................................... Data Warehousing Fundamentals 17-49 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... Summary The successful warehouse: • • • • • • • Is driven by the business Focuses on objectives Adds value to the business Can be understood and used Delivers good data Performs well Belongs to the users Copyright Oracle Corporation, 1999. All rights reserved. ..................................................................................................................................................... 17-50 Data Warehousing Fundamentals Summary ..................................................................................................................................................... Summary Success is achieved if the data warehouse: • Is driven by a business community with clearly identified requirements. Remember that this is the primary objective of the data warehouse, and the users must be responsible for driving the end result. • Focuses on the objectives outlined in the early stages of development. • Adds value to the decision making process, and can be seen to provide value with better and proven results. It is important that you define the measurement of the success of the warehouse. Without any measures, you cannot determine whether the warehouse has added value. • Can be understood by the business community. The data in the warehouse must be understood to ensure that the users are capable of using it to full effect. The data must also mean the same to all users. For example, an algorithm that provides a statistic must be documented in a way that every user can understand. • Is used by the business community because the value it delivers is tangible. If the data warehouse does not deliver quality information with integrity that adds value to the business, then it will not be used. • Performs as defined by the users in any agreements outlined early in development. • Belongs to the users and not the IT department. ..................................................................................................................................................... Data Warehousing Fundamentals 17-51 Lesson 17: Managing the Data Warehouse ..................................................................................................................................................... ..................................................................................................................................................... 17-52 Data Warehousing Fundamentals A ................................ Practice Solutions Appendix A: Practice Solutions ..................................................................................................................................................... Practice 2-1 Answer the following questions. 1 OLTP databases hold up-to-the-minute information and are most commonly designed as read-only databases. True False The correct answer is False because OLTP databases are not read-only databases. 2 In the scenario below, state whether it refers to an operational system or an analytical processing system. “Show me how a specific brand of printer is selling throughout different parts of the United States and how this specific brand of printer is selling since it was first introduced into my stores.” This scenario refers to: a An operational system b An analytical processing system The correct answer is B because comparing sales between the different territories within the United States can provide a certain type of analytical information. 3 Who is the target audience for the data warehouse? a The business community in the organization b IT professionals c Data-entry clerks d None of the above e All of the above The correct answer is A because the main reason for having a data warehouse is to aid the business community in making better decisions. 4 Are the following statements true or false? a Operational systems display the following qualities: Good performance T Static data contents F High availability T Unpredictable CPU use F b Identify the reasons why business analysis is not easy with operational systems. Data is not structured for drill-downcapability. T ..................................................................................................................................................... A-2 Data Warehousing Fundamentals Practice 2-1 ..................................................................................................................................................... The system is not designed for querying. F Data analysis can be CPU-intensive. T Data is not integrated between systems. T 5 In groups of three or four, discuss the questions below and present your points to the class at the end of the discussion. a List some of the reasons that your company is considering implementing a data warehouse or data mart. b What are some of the business problems that your company is trying to answer? c Why is the business community in your organization unable to find the answers to their business questions based on the existing information systems? General Answers Why data warehousing? According Aaron Zornes, from the Meta Group, “IT organizations are under tremendous pressure to provide better quality decisionmaking information in forms easy to access and manipulate. Business users are reacting to their own mission-critical needs for better information due to rapidly changing, increasing volatile and competitive markets, as well as ever-shortening product life cycles.” Enterprises must become more competitive and get closer to their customers to survive. Some of the reasons as to why existing information systems are unable to provide the answers to business questions are: – Much of the enterprise data is locked up in data “jailhouses” – Operational systems are unable to provide a consolidated view of data – Answering some of the business questions requires analyzing data patterns and trends over time. This often requires large volumes of historical data. Operational systems do not keep historical data. Therefore such type of analysis cannot be done in an operational system. ..................................................................................................................................................... Data Warehousing Fundamentals A-3 Appendix A: Practice Solutions ..................................................................................................................................................... Practice 3-1 1 Indicate which attributes belong to a data warehouse. Indicate whether the statements are true or false. Statement True False a Data is organized by time. Data exists in the data warehouse specifically for analysis by time. True b Data is always stored in a relational database. It is not imperative that the data be stored in a relational database, although it is more common. c Data relates to business-specific areas. The data warehouse may be enterprisewide but the way the data is organized within the database is by departmental need, subject need, and functional need. d Data is sometimes integrated. Data must always be cleaned and integrated into the warehouse. False e Data is replaced according to a refresh cycle. Data is added to and not replaced. False f Data warehouses may contain any type of data. If the database server supports any type of data, then the warehouse is capable of holding any type of data. False True True 2 _______ is a set of rules or structures providing a framework for the overall design of a system or product. a Technical infrastructure b Data-access environment c Architecture The correct answer is C. 3 The ________ is closely related to the architecture and consists of the technologies, platforms, databases, gateways, and other components necessary to make the architecture functional within the corporation. ..................................................................................................................................................... A-4 Data Warehousing Fundamentals Practice 3-1 ..................................................................................................................................................... Data access environment b Technical infrastructure c Data warehouse The correct answer is B. 4 A telco company needs to understand their network traffic to better pinpoint frequent trouble spots and predict network expansion and usage. Storing call detail records and summarizing them by switch and trunk groups among other things in another environment will satisfy this need. Which of the following are you going to design? a Operational data store (ODS) b Data warehouse The correct answer is B because monitoring over a period of time is required. 5 An online bookstore has customers in their Sales Order System and in their Marketing System. These customers do not match between systems, because Marketing staff do not always update the Marketing System with current and complete customer data. The need here is for an integrated system that contains current customer data. Which of the following are you going to design? a Operational data store (ODS) b Data warehouse The correct answer is A because the organization needs current and integrated customer data. 6 Below are some of the benefits of data warehousing. a Business decisions: – Improves decision making process – Provides basis for strategic planning efforts – Improves business decisions (quality and quantity) – Improves sales metrics – Improves trend visibility – Improves cost analysis – Improves inventory and distribution channel management – Improves monitoring of business initiatives b Data access: – Improves data availability and timeliness a ..................................................................................................................................................... Data Warehousing Fundamentals A-5 Appendix A: Practice Solutions ..................................................................................................................................................... – Improves data quality – Improves data integration – Improves access to historical information – Provides easier data access – Allows high performance data mining – Allows access to data not previously available – Improves data availability for customers c Costs: – Reduces staff – Identifies lost revenue – Optimizes space utilization – Reduces inventory – Reduces inventory replenish time d Productivity: – Provides access to data without programmer intervention – Facilitates elimination of legacy system – Reduces analysis efforts – Reduces impact on operational systems – Reduces manual analysis and data consolidation efforts ..................................................................................................................................................... A-6 Data Warehousing Fundamentals Practice 4-1 ..................................................................................................................................................... Practice 4-1 Interview Questions Ask the key persons the following questions. Possible responses from each of the candidates are shown below. Role 2: CFO 1 What is the business vision? – We are the market leader with a long tradition of dealing with drinks and beverages. – We have survived by having a strong and focused management. 2 Why does the company need an enterprise data warehouse? The board thinks that it is required to help maintain our competitive edge and market leading position. 3 What do you expect the data warehouse to provide, or what will you get out of the warehouse? Directly nothing because our financial systems are fine, but it should keep the IT Director happy. 4 How soon do you need to have data loaded into the data warehouse and how up-todate does the data need to be? If we were to do this properly, we will need all the information in the warehouse up-to-date all the time. Role 3: COO 1 What is the business vision? – Need to reengineer out core processes to maintain our market position. – Overall goal is to give my group better control over the business. 2 Why does the company need an enterprise data warehouse? To integrate the information from our disparate legacy systems and new systems as they come online—this should allow us to quickly analyze any of the information we hold. 3 What do you expect the data warehouse to provide or what will you get out of the warehouse? – Detailed customer information such as who buys our products and where our products go, provide tracking information in case there is a need to recall things. – Let us see demographics of beverage types from around the world. – Allow us to perform “what-if” analysis. ..................................................................................................................................................... Data Warehousing Fundamentals A-7 Appendix A: Practice Solutions ..................................................................................................................................................... 4 How soon do you need to have data loaded into the data warehouse and how up-to- date does the data need to be? – Daily for our top 50 customers and a weekly update for the rest. – We would probably also want to resegment our customers based on new transactions, for example, once per month. Role 4: IT Director 1 What is the business vision? To support the mission statement, we need bigger and better systems to enable us to become more competitive. 2 Why does the company need an enterprise data warehouse? In the new and modern business world you need a warehouse. Our competitors have one and we must have one in order to compete with them. 3 What do you expect the data warehouse to provide, or what will you get out of the warehouse? – Better information – Better control of new products – Take our disparate systems and help integrate them, which will bring real business benefit and control 4 How soon do you need to have data loaded into the data warehouse, and how upto-date does the data need to be? We will have daily loads for our top 50 customers, with a weekly catch-up for the rest. Class Discussion 1 Identify the major challenges for a data warehousing implementation project, as shown in this exercise. 2 Give your suggestions on how to overcome these challenges. 3 If you apply the Oracle Data Warehouse Method in this implementation to this project, how would apply it and where do you see the benefits from using this method? General Answers This exercise has been designed to get you thinking about some of the many issues that face any DSS implementation, regardless of size or complexity. The following sections outline some of the issues. Political Issues • Conflict between different parts of the business. In many businesses, very high barriers have been constructed between departments; thus the DSS can be considered to be a threat, because it will remove these barriers. ..................................................................................................................................................... A-8 Data Warehousing Fundamentals Practice 4-1 ..................................................................................................................................................... • • • • Resistance to free and open information. General resistance to change. DSS implementations by their nature invoke an emotional reaction to change and so change management should be considered carefully. Avoid making statements such as “the system will help you make better decisions” because statements like this are emotionally charged. IT will tend to control the project. IT will see the problem as technical architecture and will therefore seek to own it. The business may see it as an IT project. This follows on from the last point. The business has to step up to the project. There are difficult decisions to make such as regarding what data we place in the system, how that data is defined, how long to keep it, and how to represent it. These are the decisions that the business must take, and not IT. Approach Issues The approach to the project will have a significant impact on the overall success of the project. Some of the issues typically associated with a “bottomup” approach include: • The data warehouse may end up as a complex repository for operational data rather than one that can support the business decision making required. If this is the case, the business will inevitably lose faith in the system. • The system will eventually lose faith in the data warehouse, and so it will become another piece of legacy. • Failure to address data quality. A “bottom up” approach led by IT will typically avoid tough issues such as data quality, because IT typically lacks the influence to solve the problem. The solution lies with the business and not IT. • Over or underengineering of the solution will result, because it is difficult to hit a target when you don’t know what it looks like, especially if it is a moving target. • If the solution is seen by the business as technology rather than as a business solution, they are unlikely to invest time and effort in it. We know that if the business linkage is not present, the solution is unlikely to succeed. Sponsorship Issues • Sponsorship is critical to a project success. • Sponsorship must be effective—it is all well and good to have senior business sponsorship in the project, but this must be effective and active sponsorship, that is, involvement must be more than just attending regular meetings. • The key sponsorship chain is linked to business rather than IT. This is largely because many of the more difficult, softer issues revolve around the business and therefore need a business pull rather than IT push to resolve. • Communication to all stakeholders within the business is critical. The aims and aspirations for the project should be communicated as well as the progress of the project to assist in overcoming eventual resistance to change. ..................................................................................................................................................... Data Warehousing Fundamentals A-9 Appendix A: Practice Solutions ..................................................................................................................................................... Business Vision Issues The business must be clear about a number of factors: • How the warehouse will add value to the business • Why the warehouse will result in business change • How business change will impact on the warehouse You may have noticed that the above issues constitute a circular argument, which is important for everyone concerned to understand fully. If the warehouse is not going to change your business, why build one? General Information Issues Were the right questions asked and were honest answers always given? • You will need to ask different questions to different parts of the business. • You may not get the answers you need, because of a number of organizational and technical issues. Because much of the information we need is both tacit and politically sensitive, you should not be afraid to ask follow-up questions. ..................................................................................................................................................... A-10 Data Warehousing Fundamentals Practice 5-1 ..................................................................................................................................................... Practice 5-1 1 There are no standard solutions to item 1, as the answers are subjective and unique to each student. 2 Similarly, there are no standard solutions to item 2, as the answers are subjective and unique to each user.The expectation is that students will utilize every strategy deliverable listed in the table, as each deliverable is considered essential for a successful warehouse implementation. 3 See answer to item 2. ..................................................................................................................................................... Data Warehousing Fundamentals A-11 Appendix A: Practice Solutions ..................................................................................................................................................... Practice 6-1 1 Complete the user profile column in this exercise with one of the following user types: – Executive – Casual user or manager – Business analyst or power user Name Brian O’ Reilly Access Needs • Need to develop simple forecast, such as budgets • Ease of use is important Mary Ramos • • • Kim Seng • • Amber Salinas • • • One click access Only need highly summarized information Ease of use is very important Constantly wants to “get more data” Understands the organization’s business processes Lots of drilling Customize graphical user interface (GUI) Needs to know data structures Technology • Microsoft Office • Internet browser • Spreadsheets • Email • Email • Microsoft Office • Internet browser User Profile Casual user or manager • • • • Business analyst or power user • • • Spreadsheet Oracle Reports Oracle Discoverer Oracle Express Analyzer Extensive SQL programming Oracle7X, Oracle8X Server Oracle Express Executive Business analyst or power user 2 Answer true or false to the following questions. Question True False a Do not involve users in the early process of the data warehouse implementation because they are going to delay your delivery date. False b Choose the warehouse data access tools by involving only IT staff because they are the ones who know what the users need. False c Prototype access methods with prospective users. True 3 Security Consideration exercise: There are no standard solutions to this question. ..................................................................................................................................................... A-12 Data Warehousing Fundamentals Practice 7-1 ..................................................................................................................................................... Practice 7-1 1 Identify whether the following statements are true or false. Question The business model is a logical representation of selected business processes. The star model is normalized. The snowflake model is denormalized. All warehouses must have a time dimension. In a warehouse environment, data loading performance is less important than query performance. True True False False False True True 2 Complete these sentences. Access to data in a _________ table is faster than calculating aggregates at the time of query execution. The correct answer is summary. b The data warehouse model contains ____ tables that comprise the measures of the business. The correct answer is fact. c Dimensions are denormalized in a _______ model. The correct answer is star. d A common guideline is to define granularity at one level ________ than currently used by end users. The correct answer is lower. 3 There are no standard solutions to item 3, as the answers are subjective and unique to each student. a ..................................................................................................................................................... Data Warehousing Fundamentals A-13 Appendix A: Practice Solutions ..................................................................................................................................................... Practice 8-1 1 Form into small groups, and consider each of the following hardware architectures. With your books closed, create a short definition for each architecture. Each answer should include the benefits and limitations of each architecture. a Symmetric multiprocessing (SMP): Definition, benefits, and limitations. Please refer to pages 8-12 to 8-13. b Non-Uniform Memory (NUMA): Definition, benefits, and limitations. Please refer to pages 8-14 to 8-15. c Clusters: Definition, benefits, and limitations. Please refer to pages 8-16 to 8-17. d Massively parallel processing (MPP): Definition, benefits, and limitations. Please refer to pages 8-18 to 8-21. 2 Staying in your small group, discuss each of the following questions. a What is parallelism? It is the ability to perform functions in parallel. b Why is it important to the data warehouse? The appeal of parallel processing is especially strong for the data warehousing environment because of its emphasis on interactive processing of complex queries. Given this characteristic as well as the often extreme size of a warehouse, methods are clearly needed for more rapid query execution. By partitioning data among a set of processors, complex queries can be executed in parallel. This will potentially achieve linear speedup and thus significantly improve query response times. ..................................................................................................................................................... A-14 Data Warehousing Fundamentals Practice 9-1 ..................................................................................................................................................... Practice 9-1 1 For the following description, state the type of partitioning method it best describes. The partitioning methods are range partitioning, hash partitioning, and composite partitioning. Description Places specific ranges of table entries on different disks. For example, records having “name” as a key may have names beginning with A-B in one partition, C-D in the next, and so on. Likewise, a DSS managing monthly operations might partition each month onto a different set of disks. Distributes DBMS data evenly across the set of disk spindles. This partitioning method is applied to one or more database keys, and the records are distributed across disk subsystems accordingly. The drawback of this partitioning method is that the quantity of data may vary significantly from one partition to another and the frequency of data access may vary as well. For example, as the data accumulates, it may turn out that a larger number of customer names fall into the M-N range than the A-B range. This partition method is a combination of two partitioning methods. A table that is partitioned using this method is initially partitioned by range, and then subpartitioned using the hash method. Partitioning Method Range Hash Range Composite ..................................................................................................................................................... Data Warehousing Fundamentals A-15 Appendix A: Practice Solutions ..................................................................................................................................................... 2 For each of the following descriptions, state the type of indexing method it best describes. The indexing methods are B-tree, bimap, and index-organized tables. Description Contains a hierarchy of highest-level and succeeding lowerlevel index blocks. The upper level blocks are called branch blocks and they point to the lower-level blocks. The leaf blocks are the lower-level blocks and they contain the unique ROWID that points at the location of the actual row. This indexing method will benefit queries in which the WHERE clause contains multiple predicates on lowcardinality columns. Indexing Method B-tree Bitmap Bitmap Table Row ID 0001 0002 0003 0004 Each row has a bit for each key Male 1 0 0 1 Female 0 1 1 0 Each key value has a bit for each row. This method merges table data and index data into one structure. Thus, the data is the index and the index is the data. Index-organized table 3 Form into small groups, and consider each of the following questions. For each question, discuss in your groups and present your group’s answers to the class at the end of the discussion. a How does RAID-5 differ from RAID-1? RAID-1 (mirroring) is a strategy that aims to prevent downtime due to loss of a disk, but whereas RAID-5 in effect divides a file into chunks and places each on a separate disk, RAID-1 maintains a copy of the contents of a disk on another disk, referred to a mirrored disk. Writes to a mirrored disk may be a little slower because more than one physical disk is involved, but reads should be faster because of a choice of disks (and hence head positions) to seek to the require location. b How do I decide between RAID-5 and RAID-1? ..................................................................................................................................................... A-16 Data Warehousing Fundamentals Practice 9-1 ..................................................................................................................................................... RAID-1 is indicated for systems where complete redundancy of data is considered essential and disk space is not an issue. RAID-1 may not be practicable if disk space is not plentiful. On a system where uptime must be maximized, Oracle recommends mirroring at least the control files, and preferably also the redo log files. RAID-5 is indicated in situations where avoiding downtime because of disk problems is important, or when better read performance is needed and mirroring is not in use. c What variables can affect the performance of a RAID-5 device? The major ones are access speed of constituent disks; capacity of internal and external buses; number of buses; size of caches; number of caches; and nature of the algorithms used for determining how reads and writes are done. d What types of files are suitable for placement on RAID-5 devices? Placement of data files on RAID-5 devices is likely to give the best performance benefits, because these are usually accessed randomly. More benefit will be seen in situations where reads predominate over writes. Rollback segments and redo logs are accessed sequentially (usually for writes) and therefore are not suitable candidates for being placed on a RAID-5 device. Also, data files belonging to temporary tablespaces are not suitable for placement on a RAID-5 device. 4 For each of the descriptions below, assign the RAID level that is RAID 0, RAID 1, or RAID 5. Description This RAID level has the lowest cost and highest performance. This RAID level is low cost and has high availability. This RAID level has high performance and high availability. RAID Level 0 5 1 ..................................................................................................................................................... Data Warehousing Fundamentals A-17 Appendix A: Practice Solutions ..................................................................................................................................................... Practice 10-1 Please answer the following questions. 1 The acronym ETT stands for _________________________________________. The correct answer is extraction, transformation, and transportation. 2 Name at least four potential sources of production data for the warehouse. _____________________ _____________________ _____________________ _____________________ Correct answers include production operational systems; archives; internal files not directly associated with company operational systems, such as individual spreadsheets and workbooks; external data from outside the company. 3 Name at least five potential sources of external data for the warehouse. ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ Correct answers include periodicals and reports; external syndicated data feeds; competitive analysis information; newspapers; purchased marketing, competitive, and customer related data; free data from the Web. ..................................................................................................................................................... A-18 Data Warehousing Fundamentals Practice 10-1 ..................................................................................................................................................... 4 Identify whether the following statements are true or false. Question Archive data is never used in a data warehouse; it is too old. Archive data is particularly useful for the first time load, to include historical data. External data is one of the easiest types of data to incorporate into the warehouse. External data is difficult to incorporate, as it varies in frequency, grain, and predictability. It is impractical to eliminate data anomalies after the pilot run. Never leave data cleanup this late. Mapping data is a process whereby you eliminate data inconsistencies. Mapping identifies source data attributes, identifies where they are to reside in the warehouse, and identifies what transformations are needed. Gateways are great mechanisms for transferring large volumes of data into the warehouse. Gateways are only useful for smaller amounts of data. Extraction tools are expensive. Transforming data occurs only in the staging area. It may take place at other points, though the staging area is most common. True False X X X X X X X ..................................................................................................................................................... Data Warehousing Fundamentals A-19 Appendix A: Practice Solutions ..................................................................................................................................................... Practice 11-1 1 Dirty data must be eliminated for the data warehouse. Name three alternative and common terms used to describe the process of eliminating anomalies in data. _____________________ _____________________ _____________________ The correct answer is cleaning, cleansing, and scrubbing. 2 Name at least five problems associated with source data that must be eliminated for the data warehouse. ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ The correct answer is multipart keys, multiple encoding, multiple local standards, multiple files, missing values, element names, element meaning, input formats, duplicate values, referential integrity, names and addresses. 3 Identify whether the following statements are true or false. Question True False It is considered impractical to eliminate data anomalies after the pilot run. Never leave data cleanup this late. You need to consider adding time keys to warehouse data. All records must contain a time element contained in a key column. Data transformation occurs only in the staging area. It may take place at other points, though the staging area is most common. X X X ..................................................................................................................................................... A-20 Data Warehousing Fundamentals Practice 12-1 ..................................................................................................................................................... Practice 12-1 1 Assemble into small groups of 3 or 4. Discuss and compare the factors that will determine the load window where you work. Consider user requirements, operational constraints, and staffing issues. There is no single correct answer. 2 Identify whether the following statements are true or false. Question True False Transportation of data involves moving the data X into the data warehouse database. Strictly transportation involves move and loading the data. The data refresh cycle is determined by information technology groups. X The cycle is determined by users. The load window is the time that the IT group has dictated the data warehouse is available to the users for access. X The load window is time available to perform all ETT tasks. An example of high-level grain data is summarized data. Fact data frequently changes. X X Fact data is frequently added to at every refresh. Dimension data infrequently changes. X Dimension data changes but not as frequently as fact data is refreshed. SQL*Loader is the fastest way to move data into the data warehouse database. Gateways are useful for moving large amounts of data into the warehouse. X X Gateways are recommended only for small amounts of data. ..................................................................................................................................................... Data Warehousing Fundamentals A-21 Appendix A: Practice Solutions ..................................................................................................................................................... Question Data for the data warehouse is always indexed after it is loaded. True False X It is recommended, but is not always indexed after. The quickest way to create unique indexes on warehouse data is to leave database constraints enabled on load. X The fastest way is disable constraints and then enable them after the data is loaded. Summary tables are created on the warehouse server. Filtering removes unwanted records from staging files. X X Filtering extracts data from the warehouse into data marts. 3 Name the two different types of data loading. _____________________ _____________________ The correct answer is first time load and refresh. 4 Name four methods of moving data to the warehouse server. _____________________ _____________________ _____________________ _____________________ The correct answer is that there are five listed ways, and you may choose a hybrid of any of these. – Wholesale data replacement – Comparison of database instances – Time and date stamping – Database triggers – Database log 5 What SQL command is used to create summary tables on the data warehouse server? The correct answer is CREATE TABLE AS SELECT (CTAS), or CREATE TABLE AS SELECT... PARALLEL (pCTAS). ..................................................................................................................................................... A-22 Data Warehousing Fundamentals Practice 13-1 ..................................................................................................................................................... Practice 13-1 1 Identify whether the following statements are true or false. Question The data refresh cycle is determined by information technology groups. The cycle is determined by users. Fact data frequently changes. Fact data is frequently added to at every refresh. Dimension data infrequently changes. Dimension data changes but not as frequently as fact data is refreshed. a b c True False X X X 2 Name four different techniques for capturing the changes to operational data that is to be loaded into the warehouse. _____________________ _____________________ _____________________ _____________________ The correct answer is that there are five listed ways, and you may choose a hybrid of any of these. – Wholesale data replacement – Comparison of database instances – Time and date stamping – Database triggers – Database log 3 Answer the following questions about updating dimension data. What method of updating dimension data would you employ if you wanted to keep old and new records? The correct answer is keep history. b What relationship would that map to in an entity relationship model? The correct answer is a one to many. 4 What server technique can be used to prevent and allow access to data in the warehouse after refresh? The correct answer is the ROLES command. a ..................................................................................................................................................... Data Warehousing Fundamentals A-23 Appendix A: Practice Solutions ..................................................................................................................................................... Practice 14-1 1 Give one example of where metadata exists in an operational environment. 2 3 4 5 ________________________________________________________ The correct answer is the database server data dictionary. Why is metadata important to the following people? a Users who are accessing the data warehouse ________________________________________________________ ________________________________________________________ The correct answer is that it provides them with information about the data they are accessing, and shows them the meaning of data, context, summary levels, ownership and many more attributes. b IT staff developing ETT routines ________________________________________________________ ________________________________________________________ The correct answer is that it contains all source data information, transformation routines, mapping, structure and meaning of data. Name two techniques you might employ to create metadata. ________________________________________________________ ________________________________________________________ The correct answer is that you may have chosen two from this list: Data modeling tools, data dictionary, ETT tools, end user tools, COBOL copybooks, middleware tools. Name two roles within the data warehouse development team who have responsibility for metadata. ________________________________________________________ ________________________________________________________ The correct answer is metadata architect, metadata manager. What is the issue with integration and metadata? ________________________________________________________ ________________________________________________________ ________________________________________________________ The correct answer is that many tools have their own metadata layers, which must be integrated for the environment. ..................................................................................................................................................... A-24 Data Warehousing Fundamentals Practice 14-1 ..................................................................................................................................................... 6 What is important about the context of data? ________________________________________________________ ________________________________________________________ The correct answer is that it allows the historical perspective of data to be constantly available. 7 Name the Oracle tool you can use to develop metadata. ________________________________________________________ The correct answer is Oracle Designer, Data Mart Suite, or OADW. Oracle Warehouse Builder will also support metadata management. ..................................................................................................................................................... Data Warehousing Fundamentals A-25 Appendix A: Practice Solutions ..................................................................................................................................................... Practice 15-1 1 In the following scenarios, choose the type of analysis that most accurately defines the scenario. The types of analysis from you may choose are: – Query and reporting – Multidimensional/OLAP – Data mining – Drill-down and pivot – Calculations and derived data – Spreadsheet – Modeling, time-series and financial – What if Scenario a. Show start date and salary grade for all employees reporting to Clare Maury b. Highlight all orders above $30,000.00 • Drill from product totals to individual orders • Look at a copy of the invoice c. Show product sales in each region as a percentage of the total sales in that region. d. Did the $2 million promotion increase sales? e. How many people to hire, when to hire them, and where to locate them. f. If we lowered prices, would our overall revenue increase? g. Find me the relationship between X and Y. h. Show me all the products that are currently back-ordered. i. What is the 13 week moving average of sales? j. Projecting costs and allocating overhead based on head count, sales forecasts, and consumer price index (CPI). Type of Analysis Query and reporting Drill-down and pivot Calculation and derived data Modeling, time-series and financial Modeling, time-series and financial What-if Data mining Query and reporting Calculations and derived data Modeling, time-series and financial 2 For the following phrases and sentences, determine which category each of them belongs to. You may choose from the following list. • Data • Information • Knowledge ..................................................................................................................................................... A-26 Data Warehousing Fundamentals Practice 15-1 ..................................................................................................................................................... • Decision Description Mary lives in Belmont Shores, California. Point of sale (POS) AppleTree juice is bought 45% of the time that Crystal Geyser juice is bought. Let us promote Crystal Geyser juice on the East Coast of the United States in stores. Demographic Customers of the upper middle class will use 10% of their annual income during the Christmas holiday season. Category Information Data Knowledge Decision Data Knowledge 3 The diagram below illustrates an example of data mining. The technique that it uses is called _________________. The correct answer is artificial neural network. Age Region Loyal Call Rate Lost Service 4 The description below describes a data mining technique. What is the technique used? The correct answer is decision tree. 1. 2. 3. 4. 5. 6. If the vehicle has a 2-door frame AND If the vehicle has at least six cylinders AND If the buyer is less than 40 years old AND If the cost of the vehicle is > $35,000 AND If the vehicle color is red, THEN The buyer is likely to be male. ..................................................................................................................................................... Data Warehousing Fundamentals A-27 Appendix A: Practice Solutions ..................................................................................................................................................... Practice 16-1 Web-Based Tools Requirement Checklist There are no standard solutions to this question. ..................................................................................................................................................... A-28 Data Warehousing Fundamentals Glossary .................................. Glossary ..................................................................................................................................................... A Access The process of accessing the data warehouse database objects containing data using tools that perform analysis, standard queries, provide statistical information, and mine data. See OLAP, Data Mining, Data Access. Additive Measurements in a fact table that can be added across all the dimensions. See Dimension. Ad hoc One time only, casual, nonplanned access to the database. See Access, Data access. Aggregated data Precalculated and prestored summary data that is held in tables in the data warehouse. Aggregated data provides direct access to calculated data that improves query performance. Functions used to calculate aggregated data include SUM, MAX, MIN, COUNT, and AVG. See Summary Tables. Aggregated facts See Aggregated facts, Summary tables. Application Program Interface A set of calling conventions that allow application programs to access computing services. APIs present application developers with a published interface to computing services that can be used with other facilities to provide a single-system image across a heterogeneous network of processors. Atomic data The data at its lowest level of detail that provides the base data for all data transformations. Attribute Any detail that serves to qualify, identify, classify, quantify, or express the state of an entity. B Backup and recovery strategy A storage and recovery strategy that protects against business information loss resulting from hardware, software, or network faults. BAP See Business Alliance Program. Batch A computer environment that processes an action or user request without user interaction. Some batch programs work in the background, allowing simultaneous user access. Bitmap index A specialized form of index indicating the existence or nonexistence of a record by a series of ones and zeros. Prevalent with the Oracle7 and Oracle8 database servers. Bitmapped interface See graphical user interface. Business An enterprise, commercial entity, or firm in either the private or public sector, concerned with providing products or services to satisfy customer requirements. Business area The set of business processes within the scope of a data warehouse project. Business Alliance Program (BAP) An Oracle initiative that invites vendors to offer products and services that are complementary to those offered by Oracle. Atomic value A data value that cannot be further decomposed. ..................................................................................................................................................... Data Warehousing Fundamentals Glossary-3 Glossary ..................................................................................................................................................... Business metadata The information provided to users that allows them to understand and access warehouse data. It focuses on what data is in the warehouse, how it was transformed, the source, and the timeliness of the data. See User Metadata. Business rule A rule under which an organization operates.Business rules are applied to data using constraints. C C A third generation programming language. C++ A thrid generation programming language. Cache A temporary storage area in computer memory. Cardinality The number of rows in a table. See Table, Column, and Row. CASE See Computer-aided systems engineering. Checkpoint A database server event which at a point in time writes all modified database buffers in the system global area to the data files. The process controlling this action is called the Database Writer (DBWR). Cleaning See Cleansing. Cleansing The process of transforming the operational and external source data into a defined, and standardized format using packaged software applications or programs, prior to moving that data into the warehouse. Also referred to as data cleaning, data cleansing, or scrubbing. See Source data. Client-server A technical architecture that links many personal computers or workstations (clients) to one or more large processors (CPUs or servers). The architecture enables the separation of local client processing from the server that manages the databases, access, and data integrity. The architecture allows for optimal performance at both the client and the server sides. Cluster A means of sorting and storing related data from different tables in the database, on cluster keys. Advantageous in an environment where related data is commonly queried together. COBOL A third generation programming language. Column A means of implementing an item of data within a table. See Table, Row, Attribute. Composite key A key in a database table that is made up of a number of (column or field) values. Compound key See Composite key. Computer-aided systems engineering (CASE) The combination of graphical, dictionary, generator, project management, and other software tools to assist computer development staff engineer and maintain highquality systems. Concatenated key See Composite key. Concatenated index An index that is created on a composite key. See Composite key. Constellation model A warehouse model that comprises a collection of star models. See Star model, Snowflake model. ..................................................................................................................................................... Glossary-4 Data Warehousing Fundamentals Glossary ..................................................................................................................................................... Constraint 1.The part of the WHERE clause in an SQL SELECT statement that identifies the column or field value that qualifies the query. 2. Any external, management, or other factor that restricts a business or a systems development in terms of resources, availability, dependencies, timescales or some other factor. See Business rule. CORBA Common Object Request Broker Architecture Corporate data model A model of the business needs and data requirements for an online transaction processing system. Cost based optimizer A statistical mechanism that analyzes where and how to retrieve data from the Oracle7, Oracle8, and Oracle8i servers to ensure fast access to data. Cube A commonly used name for a dimensional database where values can be analyzed across a minimum of three dimensions. D DASD See Direct-access storage device. Data access See Access. Data acquisition The process of extracting, transforming, and transporting data from the source systems and external data sources to the data warehouse database objects. The term is synonymous with ETT, and is widely used within Data Warehouse Method. See ETT. Data aggregation The process of redefining data into a summarization based on some rules or criteria. See Aggregated data, Aggregated facts, Summary tables. Data Definition Language (DDL) SQL statements that create, modify, and remove database objects such as tables, indexes, and users. Common DDL statements are CREATE, ALTER, and DROP. See DDL. Data extract A subset of data extracted from one environment and transported to another environment. See Extract processing. Data integrity The quality of the data residing in the database objects. Constraints on the database tables enforce integrity rules. Data Manipulation Language (DML) SQL statements that query and amend the database data. Common DML statements are SELECT, INSERT, UPDATE, and DELETE. See DML. Data mart A data warehouse data class organized for a business functional area or department. The database contains data summarized at multiple levels of granularity and maybe designed using relational or multidimensional database structures. Data migration tools Unspecified tools that allow data to be moved from the various sources into the data warehouse. Data mining A technique that discovers previously unknown patterns and relationships in data. Data mining queries may take a long time to execute. Data warehouse An enterprise-structured repository of subject-oriented, time variant, integrated, historical data used for information retrieval. The very large data warehouse database stores atomic and summary level data. The data warehouse provides the source data for data marts within the enterprise. ..................................................................................................................................................... Data Warehousing Fundamentals Glossary-5 Glossary ..................................................................................................................................................... Data Warehouse Method (DWM) A structured method for full life-cycle custom development data warehouse projects. It is based on the Custom Development Method. See Custom Development Method. Database A collection of data, usually in the form of tables or files, under the control of a database management system. See Database management system. Database administrator A person within the information technology (or information systems) organization who is responsible for administering, monitoring, and maintaining the database. Database management system The component of a database that controls all user and system activities related to the core functions of the database, such as security checking, tablespace allocation, space management. Data model A representation of the specific information requirements of a business area. See Entity relationship diagram. Data source See Source. DBA See Database administrator DBMS See Database management system, Relational database management system. DDL See Data Definition Language. Decision support The act of using data and tools within an organization to support managerial decisions. Usually decision support involves the analysis of many units of data in a heuristic fashion. As a rule, decision support processing does not involve updating data. See Heuristic. Decision support systems (DSS) An application used to provide summary or consolidated data to users for analysis, planning, and performing what-if analysis by using specialized tools that are usually driven by a GUI. See Graphical user interface. Delta A file created by an application that contains only changes made to the application. Denormalization A database design function that restructures a database by introducing derived data, replicated data, and repeating data. The technique is often employed to enhance performance within decision support and data warehouse environments. See Data warehouse, Decision support systems. Denormalized data The data within a denormalized database model. See Denormalization. Dependent data mart A data mart that is sourced directly from an existing data warehouse. See Data mart, Independent data mart. Derived column A value derived by some algorithm from the values of other columns. See Derived data. Derived data Data that exists only as a subset of other data. Also called Derived attribute. Designer/2000 The Oracle computeraided systems engineering (CASE) tool. Detail data See Fact data. ..................................................................................................................................................... Glossary-6 Data Warehousing Fundamentals Glossary ..................................................................................................................................................... Developer/2000 The Oracle application building tool for query, reporting, database manipulation, and graphical display of database values. Dimension A construct within a multidimensional structure that represents a side of a multidimensional cube. Each dimension represents a different category that the business chooses to measure by, such as customer, region, product, and time. Dimension data The data by which the user queries the business measurables. Contained in dimension tables. See Fact data, Fact tables, Dimension table, Dimension model. DML See Data Manipulation Language. Drill-across A technique that queries data from two or more fact tables in a single report. Drill-down An analytical technique that queries data from a summary row and navigates through a hierarchy of data to reach the detail-level rows. Drill-up An analytical technique that navigates from detail to header rows of data. Use to view summarized (or aggregated data). DWM See Data Warehouse Method. Dimension table A table in a star model that is joined to the fact table by a key value. E Dimensional model A model that supports a top-down design methodology. For each business process, it determines relevant facts and dimensions. End User Layer (EUL) The user interface and layout of multidimensional structures designed for the data access tools. This includes customization of the tools for end users. Direct-access storage device (DASD) A data storage unit where data can be accessed directly without having to progress through a serial file such as magnetic tape. Enterprise A group of departments, divisions, groups, or companies that make up a business. See Business. Dirty data Data that is in an unfit state to be loaded into the data warehouse. It must be transformed first. See Transformation, Cleaning. Discoverer The Oracle end-user analysis, query, and reporting tool that is particularly good for use in the data warehousing environment. Discrete Usually used with reference to dimension attributes. Data, usually text, that takes on a fixed set of values that rarely change. Enterprise Manager An Oracle product that gives a GUI front end to systems and databases for enterprise wide systems management. Enterprise model business. A neutral model of the Entity relationship diagram (ERD) A diagram that pictorially represents entities, the relationships between them and the attributes used to describe them. ..................................................................................................................................................... Data Warehousing Fundamentals Glossary-7 Glossary ..................................................................................................................................................... Entity relationship model (ERM) A type of data model. Part of the business model that consists of many entity relationship diagrams. See Entity relationship diagram. ETT An acronym that stands for extraction, transformation, and transportation. It refers to the methods involved in cleaning operational data and moving it from source systems into the warehouse. Fact table The core (central) table in a star or snowflake model, characterized by a composite key. Values in the composite key join to keys in the dimension tables. See Composite key, Dimension table, Detail data. Feedback Response to requests, including corrections, additions, and approval elicited from users, sponsors, and any others with an interest in the data warehouse. EUL See End-user layer. Express The generic name of a suite of Oracle products that enable users to analyze multidimensional data and perform complex analysis for decision support. External data Data originating from a nonoperational source or outside the central processing complex, such as magazines, newspapers, and financial companies. File Transfer Protocol (FTP) A method for transferring files from one location to another. Foreign key A key data value, (which may comprise one or more columns), in a relational database table that joins to a primary key on another table. See Primary key. Forms See Oracle Forms. Extract processing The process of selecting data from one environment and transporting it to another environment for use by individual users or departments. FTP See File Transfer Protocol. Extraction The process of selecting and pulling data from the operational and external data sources, in order to prepare it for the warehouse. Also called data extraction. Gap analysis The process of determining and evaluating the variance between two items’ properties. Extraction, transformation, and transportation See ETT. F Fact data The measurements, within the core of the data warehouse, on which all OLAP queries depend. See Online analytical processing, Fact table. G Gateway A technology that enables interserver communication using various communication protocols. Generalized key A dimension table primary key that is created by modifying an existing key. Generalized keys are also used with slowly changing dimensions and summary data. Gigabyte One thousand million bytes. ..................................................................................................................................................... Glossary-8 Data Warehousing Fundamentals Glossary ..................................................................................................................................................... Grain The level of detail of the data stored in the database or data warehouse or moved into the data warehouse from source systems. Granularity See Grain. Graphical user interface (GUI) A user interface that is driven by point-and-click operations using a mouse rather than a keyboard. Also known as a bitmapped interface. H Heuristic The process of learning by discovery. Hierarchical database An older style of database where records are strictly related and access is strictly defined. Householding In the financial services sector, assigning a customer account or individual, to a collection of accounts, individuals, or locations for marketing purposes. Hypercube A multidimensional model supporting more than three dimensions. You can visualize this model by considering a number of three dimensional cubes that are related to one another. Hypertext Markup Language (HTML) The language used to create HTML pages for the Web using a word processor or text editor. Hypertext Transfer Protocol (HTTP) The first component, the protocol, of a URL address, used widely in the Internet and intranet environment. HTTP defines how to interpret information. Other common protocols you may come across include FTP, news, and gopher. See Uniform Resource Locator. I Implementation The installation of an increment of the data warehouse solution (hardware, software, documentation, training) that is complete, installed, tested, proved, operational and ready to use. Increment The defined scope of the portion of the data warehouse selected for implementation. Each increment satisfies elements of the total data warehouse solution. Incremental development A technique for producing all or part of a production system based on an outline definition. The technique involves iterations of a cycle of build, refine, and review so that the correct solution emerges. Independent data mart A data mart that is sourced directly from operational systems. See Data mart, Dependent data mart. Index An area of the database storage dedicated to holding key data values to allow direct access to a database row. Information requirement The detail and summary data and access functionality required to satisfy the users’ decision support and analysis functions for decision making and planning. Initial load The first population (insert) of the production data warehouse database with data from source systems. This load often contains large amounts of historical data. See Load, Refresh cycle. Integrate To take data from a variety of different sources, in different formats, and merge it into a single format. ..................................................................................................................................................... Data Warehousing Fundamentals Glossary-9 Glossary ..................................................................................................................................................... Integrity rules The laws that govern the operations allowed on the data and structures of a database. Internal data Data that resides within an organization’s central processing complex. Iterative development The application of a cyclic, evolutionary approach to system development. K Knowledge worker A person whose job relies on information as a primary resource. L Legacy system An existing operational system that is used for entering data about the company’s operations. Level fields These fields are often held in dimension tables and relate to summary data stored in the central fact table. Not a common approach to storing summary data. Load The process of moving extracted, transformed into the data warehouse. See Initial load, Refresh cycle. Load window The time taken to load data from multiple source systems into the data warehouse. Can also be used to mean the time available for the data load. Logical model The phase of database design that is concerned with identifying the relationships among the tables. M Mapping The process of matching data from source systems to the structures in the data warehouse. Mapping tools Tools used to perform mapping. Massively Parallel Processor (MPP) A shared nothing architecture that takes a number of nodes and enables them to communicate rapidly. Metadata Data that contains information about the data and structures in the data warehouse. Metadata is both for business users and technical users. See Business metadata and User metadata. Metalayer An architectural component of the warehouse that resides between the warehouse data and the user, and contains metadata. See Metadata. Middleware A layer that provides an easyto-use, intuitive presentation of the underlying data or data structures. MOLAP See Multidimensional online analytical processing. Multidimensional analysis See Online analytical processing. Multidimensional database A database management system where data can be viewed and manipulated in multiple dimensions. It provides a structure that supports specialized query techniques such as drilldown, consolidation, and slicing and dicing. See Cube. ..................................................................................................................................................... Glossary-10 Data Warehousing Fundamentals Glossary ..................................................................................................................................................... Multidimensional online analytical processing (MOLAP) Data is stored and presented to the user over three or more dimensions. OLAP Server A multidimensional database that provides a data structure that enables flexible access to data and explores the relationship between summary and detail data. N OLTP See Online transaction processing system, Operational system. Nonadditive A fact that cannot be logically added between records. May be numeric and must be combined in a computation with other facts before being added across records. Nonuniform memory access (NUMA) A method of accessing shared memory on systems which have memory loosely coupled. Oracle Parallel server can work with this access method. Normalization A technique that eliminates data redundancy. See Normalized data. Normalized data Data that has been separated into groups linked by defining normal relationships, where all redundancy in the data and repeating groups of data are removed. The usual normalization level is called third normal form, represented as 3NF. See Normalization. NULL The state of a data item that indicates no value. NUMA See Nonuniform memory access. O ODS See Operational data store. Online analytical processing (OLAP) A loosely defined set of principles that provide a dimensional framework for decision support. Online analytical processing allows for analysis of data to reveal business trends and statistics that are not immediately visible in operational data. Also known as multidimensional analysis. Online transaction processing system (OLTP) The process whereby day-to-day transactional data is held in a repository that contains the operational data for the business. Operational data Data that is maintained and used for the day-to-day processing and functional requirements of the business. Operational data store A repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may act simply as a staging area for data to be moved into the warehouse. Operational system A system that supports day-to-day transactional information that supports the client’s business. See Online transaction processing system. OLAP See Online analytical processing. ..................................................................................................................................................... Data Warehousing Fundamentals Glossary-11 Glossary ..................................................................................................................................................... Oracle Expert An expert systems advisor that generates performance tuning recommendations based upon a global system view. Suggestions regarding space allocation, schema design, and indexing strategies help DBAs tune VLDB environments. P Oracle Forms An Oracle Developer/2000 tool for creating, maintaining, and running full-screen, interactive applications called forms. The forms enable users to see and change data in an Oracle database. They can be used in block mode, character mode or bit-mapped environments. Parallel Query Option The Oracle server option that splits a single database query request into a series of parallel query operations. See Parallel Processor. Oracle Method The methodology employed by Oracle for corporate system implementation. Incorporates the Data Warehouse Method and project management software. Oracle Parallel Server cessor, Oracle Server. See Parallel Pro- Oracle Reports The powerful, flexible Oracle Developer/2000 report-writing tool. Reports may be integrated with Oracle Forms or run stand-alone. Oracle Server The Oracle relational database management system (RDBMS). Components of the Oracle server include the kernel and various utilities for use by database administrators and users. See Relational database management system, Server. Oracle Trace A performance data management tool that collects, manages, and displays performance data from throughout the enterprise, including resource use (CPU, I/O, page faults) by user or component. Parallel Processor The Oracle server component that splits a single database action into many processes. See Parallel Query Option. Partitioned data Data that is physically divided across many hard disks. Data may be partitioned horizontally or vertically. The technique improves application performance and security. Also called Data partitioning. Partitioning Splitting data across different units. Partitioning may be achieved at the system or application level. Pilot An initial project that serves as a model or template for future projects. Pivoting A query technique that enables the arrangement of rows and columns to be changed in a report. PL/SQL See Procedural SQL. Primary key A single or multiple column value that uniquely identifies a single row in a relational database table. Procedural Gateway Middleware that enables data on a non-Oracle database to be viewed from Oracle applications. See Middleware, Transparent Gateway. Procedural SQL An extension to Oracle SQL. It enables SQL to be embedded within third generation programming constructs such as GOTO and LOOP statements for finer programming control. ..................................................................................................................................................... Glossary-12 Data Warehousing Fundamentals Glossary ..................................................................................................................................................... Process 1. A key element of Oracle Method. A cohesive set or thread of related tasks that meets a specific project objective. A process results in one or more key deliverables. 2. A sequential execution of functions triggered by one or more events. See Oracle Method, Data Warehouse Method (DWM). Proof-of-concept An approach that contains a well-defined set of objectives and is scoped to demonstrate the immediate business benefit of an increment of the data warehouse. See Increment. Q Query Manager Middleware that presents the user querying data with an easy-to-use and clear picture of the underlying business data. R RDBMS See Relational database management system, Oracle Server. Reach-through Used by online analytical processing tools to access directly data on a relational database server. The tool presents the data in a multidimensional manner. Reference data Data held in reference tables. See Reference tables. Reference tables Hold textural data that contain expanded descriptions of data resident in dimension tables. Referential integrity A condition that guarantees that the values in one column also exist in another column. This guarantee is enforced through the use of integrity constraints. Refresh The process of updating the data warehouse database objects with new data. The refresh process occurs on a predefined and scheduled basis after initial load. See Initial load, Refresh cycle. Refresh cycle The frequency by which data in the data warehouse database objects is updated with new data. The cycle is determined by user business requirements. Regular process of updating the data warehouse with further fact (detail) data and creating appropriate summary tables and data indexes. Relational database management system (RDBMS) Software that creates and maintains the database system, as well as the data stored in the database (in Oracle terms, Version 6 and earlier). See Server. Relational online analytical processing (ROLAP) An implementation that presents the user with a multidimensional view of data that originates from a relational database structure. Replication Method whereby copies of databases are maintained at multiple sites in a distributed system, to improve availability and response times. Replication is frequently employed as part of a backup and recovery strategy. Reports See Oracle Reports. ROLAP See Relational online analytical processing. Row A series of attributes that identify the characteristics, to be stored on the database, of a significant object, such as a person. Also referred to as tuple. See Table. ..................................................................................................................................................... Data Warehousing Fundamentals Glossary-13 Glossary ..................................................................................................................................................... S Schema A logical representation or model of a database structure. Scrubbing See Cleansing. Semiadditive A numeric fact that can be added along some dimensions in a fact table but not others. Server Software that handles the functions required for concurrent, shared access to a database. The server receives and processes SQL and PL/SQL statements originating from client applications. The computer that runs the Server must be optimized for its duties. The Oracle server was previously called the Relational database management system. See Relationaldatabase management system. Slice and dice A mechanism whereby a query can analyze information along any dimension of the multidimensional model equally. Slowly changing dimensions The tendency of dimension records, particularly the product and customer dimensions, to change gradually or occasionally over time. Snapshots A copy (or dump) of the data in a database at any given point in time. Snowflake model A normalized version of the star model, employed in data warehouse implementations. See Star model, Constellation model. Source data The data that is used as the basis of warehouse data, maybe from a database, flat files, or magazine articles. Also called data source. SQL*Loader An Oracle tool that enables streams of data to be loaded into files or a database. SQL (Structured Query Language) The internationally accepted standard language for relational systems. See Data Manipulation Language, Data Definition Language. SQL statement A complete command or statement written in the SQL language. Staging area A file, operational data store, or series of relational database server tables that contains the data to be moved to the warehouse. Star query Optimization technique that enables the dimensions and fact tables in the star model to be accessed efficiently, and data to be returned to the user efficiently. It ensures that the dimension data is visited first, and the fact data last and only once. Star model A database organization in which a fact table with a composite key is joined to a number of single-level dimension tables. The model is used in data warehouse implementations. See Constellation model, Snowflake model. Subject area A vertical portion of the business, such as Sales and Marketing, that is developed as an iteration of the enterprisewide data warehouse. Summary data Data that is aggregated and stored in a summary fact table and made available to the user for direct and easy access. Summary table A data structure in the warehouse that contains summarized (or aggregated) facts. See Summary data. ..................................................................................................................................................... Glossary-14 Data Warehousing Fundamentals Glossary ..................................................................................................................................................... Symmetric Multiprocessor (SMP) A shared everything hardware and software architecture, where memory and disk controllers are accessible to all CPUs. See CPU. System Global Area (SGA) A large area of memory allocated to a database instance for caching. See Cache. T Table A relational database structure that comprises vertical columns (attributes) and horizontal rows (tuples) of data. See Primary key, row, and column. Terabyte Usage curve A line chart showing the amount of CPU used at any time during normal system activity. User A person at any level of the organization who needs to access the data in the data warehouse for information in order to perform a business function. User metadata The information provided to users that allows them to understand and access warehouse data. It focuses on what data is in the warehouse, how it was transformed, the source, and the timeliness of the data. See Business metadata and Transformation. One trillion bytes. Time stamp A date and time value written to a record when it is created or changed in the database. Transformation The process of redefining data based on predefined rules, using specific formulas and techniques. Also called data transformation. See ETT. V Very large database (VLDB) A very large database is measured in gigabytes and Terabytes. Very large memory (VLM) Computers with 64 bit memory structures. VLDB See Very large database. Transparent Gateway Middleware that enables viewing of data resident in a nonOracle database from Oracle applications. See Middleware, Procedural Gateway. Transportation The movement of data to the warehouse server. Also called data transportation. See ETT. VLM See Very large memory. W Warehouse manager The mechanism that maintains the data in the warehouse database. U Warehouse Technology Initiative (WTI) Uniform Resource Locator (URL) Text used to identify and address an item in a computer network. An Oracle program that invites other vendors to offer products and services that are complementary to those offered by Oracle, particularly in the area of products and services related to data warehousing. ..................................................................................................................................................... Data Warehousing Fundamentals Glossary-15 Glossary ..................................................................................................................................................... WTI See Warehouse Technology Initiative. ..................................................................................................................................................... Glossary-16 Data Warehousing Fundamentals