Download 幻灯片 1 - 这是一个测试

Data Warehousing Fundamentals ｎｏｙｎｏｔ@１６３．ｃｏｍ Course Objectives  After completing this course, you should be able to do the following:  Describe the role of business intelligence (BI) and data warehousing in today’s marketplace  Describe data warehousing terminology and the various technologies that are required to implement a data warehouse  Explain the implementation and organizational issues surrounding a data warehouse project  Identify data warehouse modeling concepts  Explain the extraction, transformation, and loading processes for building a data warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Course Objectives  Identify management and maintenance processes that are associated with a data warehouse project  Describe methods for refreshing warehouse data  Explain warehouse metadata concepts  Identify tools that can be employed at each stage of the data warehouse project  Describe user profiles and techniques for querying the warehouse  Identify methods and tools for accessing and analyzing warehouse data ｎｏｙｎｏｔ@１６３．ｃｏｍ Lessons 1. 2. Business Intelligence and Data Warehousing Defining Data Warehouse Concepts and Terminology 3. Planning and Managing the Data Warehouse Project 4. Modeling the Data Warehouse 5. Building the Data Warehouse: Extracting Data 6. Building the Data Warehouse: Transforming Data 7. Building the Data Warehouse: Loading Warehouse Data 8. Refreshing Warehouse Data 9. Leaving a Metadata Trail 10. Managing and Maintaining the Data Warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Let’s Get Started Lesson 1 ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 1 Objectives  After completing this lesson, you should be able to do the following:  Describe the role of business intelligence in today’s marketplace  Describe why an online transaction processing system (OLTP) is not suitable for analytical reporting  Describe how extract processing for decision support querying led to data warehouse solutions that are employed today  Explain why businesses are driven to employ data warehouse technology ｎｏｙｎｏｔ@１６３．ｃｏｍ What Is Business Intelligence? “Business Intelligence is the process of transforming data into information and through discovery transforming that information into knowledge.” Gartner Group ｎｏｙｎｏｔ@１６３．ｃｏｍ Purpose of Business Intelligence The purpose of business intelligence is to convert the volume of data into business value through analytical reporting. Decision Knowledge Information Data Value Volume ｎｏｙｎｏｔ@１６３．ｃｏｍ Early Management Information Systems  MIS systems provided business data.  Reports were developed on request.  Reports provided little analysis capability.  Decision support tools gave personal ad hoc access to data. Ad hoc access Production platforms Operational reports Decision makers ｎｏｙｎｏｔ@１６３．ｃｏｍ Analyzing Data from Operational Systems  Data structures are complex.  Systems are designed for high performance and throughput.  Data is not meaningfully represented.  Data is dispersed.  OLTP systems may be unsuitable for intensive queries. Production platforms Operational reports ｎｏｙｎｏｔ@１６３．ｃｏｍ Why OLTP Is Not Suitable for Analytical Reporting OLTP Analytical Reporting Information to support day-to-day service Historical information to analyze Data stored at transaction level Data needs to be integrated Database design: Normalized Database design: Denormalized, star schema ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Extract Processing  End user computing offloaded from the operational environment  User’s own data Operational systems Extracts Decision makers ｎｏｙｎｏｔ@１６３．ｃｏｍ Management Issues with Data Extract Programs Operational systems Extracts Decision makers Extract Explosion ｎｏｙｎｏｔ@１６３．ｃｏｍ Productivity Issues with Extract Processing  Duplicated effort  Multiple technologies  Obsolete reports  No metadata ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Quality Issues with Extract Processing  No common time basis  Different calculation algorithms  Different levels of extraction  Different levels of granularity  Different data field names  Different data field meanings  Missing information  No data correction rules  No drill-down capability ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Warehousing and Business Intelligence Legacy Data Enterprise Data Warehouse Operations Data External Data Data Marts Analytical Reporting ｎｏｙｎｏｔ@１６３．ｃｏｍ Advantages of Warehouse Processing Environments  Controlled  Reliable  Quality information  Single source of data Internal and external systems Data warehouse Decision makers ｎｏｙｎｏｔ@１６３．ｃｏｍ Advantages of Warehouse Processing Environments  No duplication of effort  No need for tools to support many technologies  No disparity in data, meaning, or representation  No time period conflict  No algorithm confusion  No drill-down restrictions ｎｏｙｎｏｔ@１６３．ｃｏｍ Success Factors for a Dynamic Business Environment  Know the business  Reinvent to face new challenges  Invest in products  Invest in customers  Retain customers  Invest in technology  Improve access to business information  Provide superior services and products  Be profitable ｎｏｙｎｏｔ@１６３．ｃｏｍ Business Drivers for Data Warehouses  Provide supporting information systems  Get quality information: • Reduce costs • Streamline the business • Improve margins ｎｏｙｎｏｔ@１６３．ｃｏｍ Technological Advances Enabling Data Warehousing  Hardware  Operating system  Database  Query tools •  Applications • • • • • Large databases 64-bit architectures Indexing techniques Affordable, cost-effective open systems Robust warehouse tools Sophisticated end user tools ｎｏｙｎｏｔ@１６３．ｃｏｍ两种数据的区别ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary  In this lesson, you should have learned how to:  Describe the role of business intelligence in today’s marketplace  Describe why an online transaction processing system (OLTP) is not suitable for analytical reporting  Describe how extract processing for decision support querying led to data warehouse solutions employed today  Explain why businesses are driven to employ data warehouse technology ｎｏｙｎｏｔ@１６３．ｃｏｍ Practice 1-1 Overview  This practice covers the following topics:  Answering questions about data warehousing  Discussing how data warehousing meets business needs ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 2 Defining Data Warehouse Concepts and Terminology Objectives  After completing this lesson, you should be able to do the following:  Identify a common, broadly accepted definition of a data warehouse  Describe the differences of dependent and independent data marts  Identify some of the main warehouse development approaches  Recognize some of the operational properties and common terminology of a data warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Definition of a Data Warehouse  “A data warehouse is a subject oriented, integrated, non-volatile, and time variant collection of data in support of management’s decisions.”  — W.H. Inmon “数据仓库是一个面向主题的、集成的、随时间变化的、非易失的、用于战略决策的数据集合” “Building the Data Warehouse”(1991) ｎｏｙｎｏｔ@１６３．ｃｏｍ Definition of a Data Warehouse  “…数据仓库无非是所有数据集市的集合...” — Ralph Kimball  “数据仓库是信息数据库的具体实现，用来存储源自业务数据库的共享数据。典型的数据仓库应该是一个主题数据库，支持用户从巨大的运营数据存储中发现信息，支持对业务趋势进行跟踪和响应，实现业务的预测和计划。” — DM Review ｎｏｙｎｏｔ@１６３．ｃｏｍ Definition of a Data Warehouse “An enterprise structured repository of subject-oriented, time-variant, historical data used for information retrieval and decision support. The data warehouse stores atomic and summary data.” — Oracle’s Data Warehouse Definition “数据仓库是一个过程而不是一个项目” —另一角度描述数据仓库ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Warehouse Properties Integrated Subjectoriented Data Warehouse Nonvolatile Time-variant ｎｏｙｎｏｔ@１６３．ｃｏｍ Subject-Oriented  Data is categorized and stored by business subject rather than by application. OLTP Applications Data Warehouse Subject Equity Plans Shares Insurance Loans Savings Customer financial information ｎｏｙｎｏｔ@１６３．ｃｏｍ Integrated  Data on a given subject is defined and stored once. Savings Current Accounts Loans OLTP Applications Customer Data Warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Time-Variant  Data is stored as a series of snapshots, each representing a period of time. Data Warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Nonvolatile  Typically data in the data warehouse is not updated or deleted. Operational Warehouse Load Insert, Update, Delete, or Read Read ｎｏｙｎｏｔ@１６３．ｃｏｍ Changing Warehouse Data Operational Databases Warehouse Database First time load Refresh Refresh Refresh Purge or Archive ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Warehouse Versus OLTP Property OLTP Data Warehouse Response Time Sub seconds to seconds Seconds to hours Operations DML Primarily Read only Nature of Data 30 – 60 days Snapshots over time Data Organization Application Subject, time Size Small to large Large to very large Data Sources Operational, Internal Operational, Internal, External Activities Processes Analysis ｎｏｙｎｏｔ@１６３．ｃｏｍ Usage Curves  Operational system is predictable  Data warehouse: • Variable • Random ｎｏｙｎｏｔ@１６３．ｃｏｍ Enterprise wide Warehouse  Large scale implementation  Scopes the entire business  Data from all subject areas  Developed incrementally  Single source of enterprisewide data  Synchronized enterprisewide data  Single distribution point to dependent data marts ｎｏｙｎｏｔ@１６３．ｃｏｍ数据仓库设计中心思想  具有一个合适的粒度或细节以满足所有的数据集市  设计不能阻碍在数据集市中使用各种技术，能适应多维集市、统计、挖掘及探索型仓库ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Marts  数据仓库数据的一个子集。 BI环境中的大部分分析活动均在数据集市中进行。每个数据集市中的数据通常是为特定的功能所定制，不必对其他的使用有效。ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Warehouses Versus Data Marts Property Data Warehouse Data Mart Scope Enterprise Department Subjects Multiple Single-subject, LOB Data Source Many Few Implementation time Months to years Months ｎｏｙｎｏｔ@１６３．ｃｏｍ Dependent Data Mart Data Marts Operational Systems Legacy Data Flat Files Operations Data External Data Data Warehouse Marketing Sales Finance HR External Data Marketing Sales Finance ｎｏｙｎｏｔ@１６３．ｃｏｍ Independent Data Mart Operational Systems Legacy Data Flat Files Sales or Marketing Operations Data External Data External Data ｎｏｙｎｏｔ@１６３．ｃｏｍ Features of a Data Mart  Not Real-Time Data  Consolidation and Cleansing ｎｏｙｎｏｔ@１６３．ｃｏｍ Warehouse Development Approaches  “Big bang” approach  Incremental approach: • Top-down incremental approach • Bottom-up incremental approach ｎｏｙｎｏｔ@１６３．ｃｏｍ “Big Bang” Approach Analyze enterprise requirements Build enterprise data warehouse Report in subsets or store in data marts ｎｏｙｎｏｔ@１６３．ｃｏｍ Top-Down Approach Analyze requirements at the enterprise level Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load processes for the initial subject area Create and populate the initial subject area data mart within the overall warehouse framework ｎｏｙｎｏｔ@１６３．ｃｏｍ Bottom-Up Approach Define the scope and coverage of the data warehouse and analyze the source systems within this scope Define the initial increment based on the political pressure, assumed business benefit and data volume Implement base technical architecture and establish metadata, extraction, and load processes as required by increment Create and populate the initial subject areas within the overall warehouse framework ｎｏｙｎｏｔ@１６３．ｃｏｍ Incremental Approach to Warehouse Development  Multiple iterations  Shorter implementations  Validation of each phase Increment 1 Strategy Definition Analysis Design Iterative Build Production ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Warehousing Process Components  Methodology  Architecture  Extraction, Transformation, and Load (ETL)  Implementation  Operation and Support ｎｏｙｎｏｔ@１６３．ｃｏｍ Methodology  Ensures a successful data warehouse  Encourages incremental development  Provides a staged approach to an enterprisewide warehouse: • • • • Safe Manageable Proven Recommended ｎｏｙｎｏｔ@１６３．ｃｏｍ Architecture  “Provides the planning, structure, and standardization needed to ensure integration of multiple components, projects, and processes across time.”  “Establishes the framework, standards, and procedures for the data warehouse at an enterprise level.” – — The Data Warehousing Institute ｎｏｙｎｏｔ@１６３．ｃｏｍ Extraction, Transformation, and Load (ETL) “Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical data warehousing project.” — DM Review, March 2001 Source Staging Area Target ｎｏｙｎｏｔ@１６３．ｃｏｍ Implementation Data Warehouse Architecture Ex., Incremental Implementation Implementation Increment 1 Increment 2 . . . Increment n ｎｏｙｎｏｔ@１６３．ｃｏｍ Operation and Support  Data access and reporting  Refreshing warehouse data  Monitoring  Responding to change ｎｏｙｎｏｔ@１６３．ｃｏｍ Phases of the Incremental Approach  Strategy  Definition  Analysis  Design  Build  Production Increment 1 Strategy Definition Analysis Design Build Production ｎｏｙｎｏｔ@１６３．ｃｏｍ Strategy Phase Deliverables  Business goals and objectives  Data warehouse purpose, objectives, and scope  Enterprise data warehouse logical model  Incremental milestones  Source systems data flows  Subject area gap analysis ｎｏｙｎｏｔ@１６３．ｃｏｍ Strategy Phase Deliverables  Data acquisition strategy  Data quality strategy  Metadata strategy  Data access environment  Training strategy ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary  In this lesson, you should have learned how to:  Identify a common, broadly accepted definition of a data warehouse  Describe the differences of dependent and independent data marts  Identify some of the main warehouse development approaches  Recognize some of the operational properties and common terminology of a data warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Practice 2-1 Overview  This practice covers the following topics:  Answering questions regarding data warehousing concept and terminology  Discussing some of the data warehouse concept and terminology ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 3 Modeling the Data Warehouse Objectives After completing this lesson, you should be able to do the following:  Discuss data warehouse environment data structures  Discuss data warehouse database design phases: • Defining the business model • Defining the dimensional model • Defining the physical model ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Warehouse Modeling Issues  Among the main issues that data warehouse data modelers face are:  Different data types  Many ways to use warehouse data  Many ways to structure the data  Multiple modeling techniques  Planned replication  Large volumes of data ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Warehouse Environment Data Structures  The data modeling structures that are commonly found in a data warehouse environment are:  Third normal form (3NF)  Star schema  Snowflake schema ｎｏｙｎｏｔ@１６３．ｃｏｍ Star Schema Model Product Table Store Table Product_id Product_disc,... Store_id District_id,... Sales Fact Table Central fact table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Denormalized dimensions Time Table Day_id Month_id Year_id,... Item Table Item_id Item_desc,... ｎｏｙｎｏｔ@１６３．ｃｏｍ Snowflake Schema Model Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Time Table Item Table Dept Table Mgr Table Week_id Period_id Year_id Item_id Item_desc Dept_id Dept_id Dept_desc Mgr_id Dept_id Mgr_id Mgr_name ｎｏｙｎｏｔ@１６３．ｃｏｍ Snowflake Schema Model  Direct use by some tools  More flexible to change  Provides for speedier data loading  Can become large and unmanageable  Degrades query performance  More complex metadata Country State County City ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Warehouse Database Design Phases  Phase 1: Defining the business model  Phase 2: Defining the dimensional model  Phase 3: Defining the physical model ｎｏｙｎｏｔ@１６３．ｃｏｍ Phase 1: Defining the Business Model  Performing strategic analysis  Creating the business model  Documenting metadata ｎｏｙｎｏｔ@１６３．ｃｏｍ Performing Strategic Analysis  Identify crucial business processes  Understand business processes  Prioritize and select the business processes to implement High Business Benefit Low Low Feasibility High ｎｏｙｎｏｔ@１６３．ｃｏｍ Creating the Business Model  Defining business requirements: • • • • Identifying the business measures Identifying the dimensions Identifying the grain Identifying the business definitions and rules  Verifying data sources ｎｏｙｎｏｔ@１６３．ｃｏｍ Business Requirements Drive the Design Process  Primary input Business Requirements  Secondary input Existing Metadata Production ERD Model Research ｎｏｙｎｏｔ@１６３．ｃｏｍ Identifying Measures and Dimensions Measures The attribute varies continuously:  Balance  Units Sold  Cost  Sales The attribute is perceived as constant or discrete:  Product  Location  Time  Size Dimensions ｎｏｙｎｏｔ@１６３．ｃｏｍ Using a Business Process Matrix Business Processes Business Dimensions Sales Returns Inventory Customer Date Product Channel Promotion Sample of business process matrix ｎｏｙｎｏｔ@１６３．ｃｏｍ Determining Granularity YEAR? QUARTER? MONTH? WEEK? DAY? ｎｏｙｎｏｔ@１６３．ｃｏｍ Identifying Business Rules Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Time Month > Quarter > Year Product Type Monitor Status PC Server 15 inch 17 inch 19 inch None New Rebuilt Custom Store Store > District > Region ｎｏｙｎｏｔ@１６３．ｃｏｍ Documenting Metadata  Documenting metadata should include:  Documenting the design process  Documenting the development process  Providing a record of changes  Recording enhancements over time ｎｏｙｎｏｔ@１６３．ｃｏｍ Metadata Documentation Approaches  Automated • Data modeling tools • ETL tools • End-user tools  Manual ｎｏｙｎｏｔ@１６３．ｃｏｍ Phase 2: Defining the Dimensional Model  Identify fact tables: • Translate business measures into fact tables • Analyze source system information for additional measures  Identify dimension tables  Link fact tables to the dimension tables  Model the time dimension ｎｏｙｎｏｔ@１６３．ｃｏｍ Star Dimensional Modeling Product Table Product_id Product_desc ... Time Table Day_id Month_id Period_id Year_id Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units ... Store Table Store_id District_id ... Item Table Item_id Item_desc ... ｎｏｙｎｏｔ@１６３．ｃｏｍ Fact Table Characteristics  Contain numerical metrics of the business  Can hold large volumes of data  Can grow quickly  Can contain base, derived, and summarized data Sales Fact Table  Are typically additive Product_id  Are joined to dimension Store_id tables through foreign keys Item_id that reference primary Day_id keys in the dimension tables Sales_amount Sales_units ... ｎｏｙｎｏｔ@１６３．ｃｏｍ Dimension Table Characteristics  Dimension tables have the following characteristics:  Contain textual information that represents the attributes of the business  Contain relatively static data  Are joined to a fact table through a foreign key reference ｎｏｙｎｏｔ@１６３．ｃｏｍ Star Dimensional Model Characteristics  The model is easy for users to understand.  Primary keys represent a dimension.  Nonforeign key columns are values.  Facts are usually highly normalized.  Dimensions are completely denormalized.  Fast response to queries is provided.  Performance is improved by reducing table joins.  End users can express complex queries.  Support is provided by many front-end tools. ｎｏｙｎｏｔ@１６３．ｃｏｍ Using Time in the Data Warehouse  Defining standards for time is critical.  Aggregation based on time is complex. ｎｏｙｎｏｔ@１６３．ｃｏｍ The Time Dimension  Time is critical to the data warehouse. A consistent representation of time is required for extensibility. Sales fact Time dimension Where should the element of time be stored? ｎｏｙｎｏｔ@１６３．ｃｏｍ Using Data Modeling Tools  Tools with a GUI enable definition, modeling, and reporting.  Avoid a mix of modeling techniques caused by: • Development pressures • Developers with lack of knowledge • No strategy  Determine a strategy.  Write and publish formally.  Make available electronically. ｎｏｙｎｏｔ@１６３．ｃｏｍ Phase 3: Defining the Physical Model  Translate the dimensional design to a physical model for implementation.  Define storage strategy for tables and indexes.  Perform database sizing.  Define initial indexing strategy.  Define partitioning strategy.  Update metadata document with physical information. ｎｏｙｎｏｔ@１６３．ｃｏｍ Physical Model Design Tasks  Define naming and database standards.  Perform database sizing.  Develop initial indexing strategy.  Develop data partition strategy.  Define storage parameters.  Set initialization parameters.  Use parallel processing.  Define summary data.  Determine hardware architecture. ｎｏｙｎｏｔ@１６３．ｃｏｍ Database Object Naming Conventions  Develop a reasonable list of abbreviations.  List all the objects’ names, and work with the user community to define them.  Resolve name disputes.  Document your naming standards in the metadata document.  Plan for the naming standards to be a living document. ｎｏｙｎｏｔ@１６３．ｃｏｍ Architectural Requirements Scalability Manageability Availability Extensibility Integration Flexibility User Budget Business Technology ｎｏｙｎｏｔ@１６３．ｃｏｍ Strategy for Architecture Definition  Obtain existing architecture plans.  Obtain existing capacity plans.  Document existing interfaces.  Prepare capacity plan.  Prepare technical architecture.  Document operating system requirements.  Develop recovery plans.  Develop security and control plans.  Create architecture.  Create technical risk assessment. ｎｏｙｎｏｔ@１６３．ｃｏｍ Hardware Requirements  SMP  Cluster  MPP  Hybrids (employing both SMP and MPP) ｎｏｙｎｏｔ@１６３．ｃｏｍ Making the Right Choice  Requirements differ from operational systems.  Benchmark • Available from vendors • Develop your own • Use realistic queries  Scalability is important. ｎｏｙｎｏｔ@１６３．ｃｏｍ Storage and Performance Considerations  Database sizing • Test Load Sampling  Data partitioning • Horizontal partitioning • Vertical partitioning  Indexing • B-Tree indexes • Bitmap indexes • Bitmap-join indexes  Star query optimization • Star transformation ｎｏｙｎｏｔ@１６３．ｃｏｍ Database Sizing  Sizing influences capacity planning and systems environment management.  Sizing is required for: • The database • Other storage areas  Sizing is not an exact science.  Techniques vary. ｎｏｙｎｏｔ@１６３．ｃｏｍ Test Load Sampling  Analyze a representative sample of the data chosen using proven statistical methods. Ensure that the sample reflects:  Test loads for different periods  Day-to-day operations  Seasonal data and worst-case scenarios  Indexes and summaries ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Partitioning  Breaking up of data into separate physical units that can be handled independently  Data partitioning provides ease of: • • • • • • • • Restructuring Reorganization Removal Recovery Monitoring Management Archiving Indexing ｎｏｙｎｏｔ@１６３．ｃｏｍ Horizontal Partitioning  Table and index data are split by: • • • • • Time Sales region or person Geography Organization Line of business  Candidate columns appear in a WHERE clause.  Analysis determines requirements. ｎｏｙｎｏｔ@１６３．ｃｏｍ Vertical Partitioning  You can use vertical partitioning when: • • • • Speed of query and update actions are improved by it Users require access to specific columns Some data is changed infrequently Descriptive dimension text may be better moved away from the dimension itself ｎｏｙｎｏｔ@１６３．ｃｏｍ Partitioning Methods  Range partitioning  List partitioning  Hash partitioning  Composite partitioning • Composite range-hash partitioning • Composite range-list partitioning  Index partitioning ｎｏｙｎｏｔ@１６３．ｃｏｍ Indexing  Indexing is used for the following reasons:  It is a huge cost saving, greatly improving performance and scalability.  It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed. ｎｏｙｎｏｔ@１６３．ｃｏｍ B-Tree Index  Most common type of indexing  Used for high cardinality columns  Designed for few rows returned ｎｏｙｎｏｔ@１６３．ｃｏｍ Bitmap Indexes  Provide performance benefits and storage savings  Store values as 1s and 0s  Use instead of B-tree indexes when: • Tables are large • Columns have relatively low cardinality ｎｏｙｎｏｔ@１６３．ｃｏｍ Bitmap Join Indexes  A bitmap index for the join of two or more tables:  They are new to Oracle9i.  They provide better performance and storage savings. ｎｏｙｎｏｔ@１６３．ｃｏｍ Parallelism Sales table Customers table P1 P2 P3 P1 P2 P3 Parallel Execution Servers ｎｏｙｎｏｔ@１６３．ｃｏｍ Using Summary Data  Designing summary tables offers the following benefits:  Provides fast access to precomputed data  Reduces use of I/O, CPU, and memory ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary  In this lesson, you should have learned how to:  Describe Data Warehouse Environment Data Structures  Define the business model: • Performing strategic analysis • Creating the business model • Identifying business rules  Define the dimensional model: • Star dimensional model characteristics  Define the physical model: • Physical model design tasks • Architectural and hardware requirements • Storage and performance considerations ｎｏｙｎｏｔ@１６３．ｃｏｍ Practice 3-1 Overview  This practice covers the following topics:  Specifying true or false to a series of statements  Completing a series of sentences accurately  Practicing identifying a simple business model  Identifying indexing method ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 4 Building the Data Warehouse: Extracting Data Objectives  After completing this lesson, you should be able to do the following:  Outline the ETL (Extraction, Transformation, and Loading) processes for building a data warehouse  Identify ETL tasks, importance, and cost  Explain how to examine data sources  Identify extraction techniques and methods  Identify analysis issues and design options for extraction processes  List the selection criteria for the ETL tools ｎｏｙｎｏｔ@１６３．ｃｏｍ Extraction, Transformation, Loading (ETL) Processes  Extract source data  Transform/clean data  Index and summarize  Load data into warehouse  Detect changes  Refresh data Operational systems Programs Gateways Tools Data Warehouse ETL ｎｏｙｎｏｔ@１６３．ｃｏｍ ETL: Tasks, Importance, and Cost Operational systems Extract Clean up Consolidate Restructure Load Maintain Refresh Data Warehouse ETL Relevant Useful Quality Accurate Accessible ｎｏｙｎｏｔ@１６３．ｃｏｍ Extracting Data  Source systems • Data from various data sources in various formats  Extraction Routines • Developed to select data fields from sources • Consist of business rules, audit trails, error correction facilities Data mapping Transform Operational databases Data staging area Warehouse database ｎｏｙｎｏｔ@１６３．ｃｏｍ Examining Data Sources  Production  Archive  Internal  External ｎｏｙｎｏｔ@１６３．ｃｏｍ Production Data  Operating system platforms  File systems  Database systems and vertical applications IMS DB2 Oracle Sybase Informix VSAM SAP Shared Medical Systems Dun and Bradstreet Financials Hogan Financials Oracle Financials ｎｏｙｎｏｔ@１６３．ｃｏｍ Archive Data  Historical data  Useful for analysis over long periods of time  Useful for first-time load  May require unique transformations Operation databases Warehouse database ｎｏｙｎｏｔ@１６３．ｃｏｍ Internal Data  Planning, sales, and marketing organization data  Maintained in the form of: • Spreadsheets (structured) • Documents (unstructured)  Treated like any other source data Planning Marketing Accounting Warehouse database ｎｏｙｎｏｔ@１６３．ｃｏｍ External Data  Information from outside the organization  Issues of frequency, format, and predictability  Described and tracked using metadata Purchased databases A.C. Nielsen, IRI, IMS, Walsh America Dun and Bradstreet Competitive information Economic forecasts Barron's Warehousing databases Wall Street Journal ｎｏｙｎｏｔ@１６３．ｃｏｍ Mapping Data  Mapping data defines:  Which operational attributes to use  How to transform the attributes for the warehouse  Where the attributes exist in the warehouse File A F1 F2 F3 123 Bloggs 10/12/56 Staging File One Number USA123 Name Mr. Bloggs DOB 10-Dec-56 Metadata File A F1 Staging File One Number F2 F3 Name DOB ｎｏｙｎｏｔ@１６３．ｃｏｍ Extraction Techniques  Programs: C, C++, COBOL, PL/SQL, Java  Gateways: transparent database access  Tools: • In-house developed tools • Vendor’s data extraction tools ｎｏｙｎｏｔ@１６３．ｃｏｍ Extraction Methods  Logical Extraction methods: • Full Extraction • Incremental Extraction  Physical Extraction methods: • Online Extraction • Offline Extraction ｎｏｙｎｏｔ@１６３．ｃｏｍ Designing Extraction Processes  Analysis: • Sources, technologies • Data types, quality, owners  Design options: • Manual, custom, gateway, third-party • Replication, full, or delta refresh  Design issues: • Volume and consistency of data • Automation, skills needed, resources ｎｏｙｎｏｔ@１６３．ｃｏｍ Maintaining Extraction Metadata  Source location, type, structure  Access method  Privilege information  Temporary storage  Failure procedures  Validity checks  Handlers for missing data ｎｏｙｎｏｔ@１６３．ｃｏｍ Extraction Tools ｎｏｙｎｏｔ@１６３．ｃｏｍ Selection Criteria  Base functionality  Interface features  Metadata repository  Open API  Metadata access  Repository utilities  Input and output processing  Cleansing, reformatting, and auditing  References  Training requirements ｎｏｙｎｏｔ@１６３．ｃｏｍ Possible ETL Failures  A missing source file  A system failure  Inadequate metadata  Poor mapping information  Inadequate storage planning  A source structural change  No contingency plan  Inadequate data validation ｎｏｙｎｏｔ@１６３．ｃｏｍ Maintaining ETL Quality  ETL must be: • Tested • Documented • Monitored and reviewed  Disparate metadata must be coordinated. ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary  In this lesson, you should have learned how to:  Outline the ETL (Extraction, Transformation, and Loading) processes for building a data warehouse  Identify ETL tasks, importance, and cost  Explain how to examine data sources  Identify extraction techniques and methods  Identify analysis issues and design options for extraction processes  List the selection criteria for the ETL tools  Identify Oracle’s solution for ETL process ｎｏｙｎｏｔ@１６３．ｃｏｍ Practice 4-1 Overview  This practice covers the following topics:  Answering a series of short questions  Answering questions based on the business scenario for Frontier Airways ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 5 Building the Data Warehouse: Transforming Data Objectives  After completing this lesson, you should be able to do the following:  Define transformation  Identify possible staging models  Identify data anomalies and eliminate them  Explain the importance of quality data  Describe techniques for transforming data  Design transformation process ｎｏｙｎｏｔ@１６３．ｃｏｍ Transformation  Transformation eliminates anomalies from operational data:  Cleans and standardizes  Presents subject-oriented data Extract Transform: Clean up Consolidate Restructure Warehouse Operational systems Load Data Staging Area ｎｏｙｎｏｔ@１６３．ｃｏｍ Possible Staging Models  Remote staging model  Onsite staging model ｎｏｙｎｏｔ@１６３．ｃｏｍ Remote Staging Model Data staging area within the warehouse environment Transform Extract Operational system Load Staging area Warehouse Data staging area in its own environment Transform Operational system Extract Load Staging area Warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ On-site Staging Model  Data staging area within the operational environment, possibly affecting the operational system Transform Extract Operational system Load Staging area Warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Anomalies  No unique key  Data naming and coding anomalies  Data meaning anomalies between groups  Spelling and text inconsistencies CUSNUM NAME ADDRESS 90233479 Oracle Limited 100 N.E. 1st St. 90233489 Oracle Computing 15 Main Road, Ft. Lauderdale 90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA 90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA ｎｏｙｎｏｔ@１６３．ｃｏｍ Transformation Routines  Cleaning data  Eliminating inconsistencies  Adding elements  Merging data  Integrating data  Transforming data before load ｎｏｙｎｏｔ@１６３．ｃｏｍ Transforming Data: Problems and Solutions  Multipart keys  Multiple local standards  Multiple files  Missing values  Duplicate values  Element names  Element meanings  Input formats  Referential Integrity constraints  Name and address ｎｏｙｎｏｔ@１６３．ｃｏｍ Multipart Keys Problem  Multipart keys Product code = 12 M 654313 45 Salesperson code Country Sales code territory Product number ｎｏｙｎｏｔ@１６３．ｃｏｍ Multiple Local Standards Problem  Multiple local standards  Tools or filters to preprocess cm DD/MM/YY 1,000 GBP inches MM/DD/YY FF 9,990 cm DD-Mon-YY USD 600 ｎｏｙｎｏｔ@１６３．ｃｏｍ Multiple Files Problem  Added complexity of multiple source files  Start simple Multiple source files Logic to detect correct source Transformed data ｎｏｙｎｏｔ@１６３．ｃｏｍ Missing Values Problem  Solution:  Ignore  Wait  Mark rows  Extract when time-stamped If NULL then field = ‘A’ A ｎｏｙｎｏｔ@１６３．ｃｏｍ Duplicate Values Problem  Solution:  SQL self-join techniques  RDMBS constraint utilities ACME Inc ACME Inc ACME Inc SQL> 2 3 4 5 6 7 SELECT ... FROM table_a, table_b WHERE table_a.key (+)= table_b.key UNION SELECT ... FROM table_a, table_b WHERE table_a.key = table_b.key (+); ｎｏｙｎｏｔ@１６３．ｃｏｍ Element Names Problem  Solution: Common naming conventions Customer Client Customer Contact Name ｎｏｙｎｏｔ@１６３．ｃｏｍ Element Meaning Problem  Avoid misinterpretation  Complex solution  Document meaning in metadata Customer’s name All customer details All details except name Customer_detail ｎｏｙｎｏｔ@１６３．ｃｏｍ Input Format Problem EBCDIC “123-73” ASCII 12373 ACME Co. áøåëéí äáàéí Beer (Pack of 8) ｎｏｙｎｏｔ@１６３．ｃｏｍ Referential Integrity Problem  Solution:  SQL anti-join  Server constraints  Dedicated tools Department Emp Name Department 10 1099 Smith 10 20 1289 Jones 20 30 1234 Doe 50 40 6786 Harris 60 ｎｏｙｎｏｔ@１６３．ｃｏｍ Name and Address Problem  Single-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565  Multiple-field format Database 1 Name Mr. J. Smith NAME LOCATION Street 100 Main St. DIANNE ZIEFELD N100 Town Bigtown HARRY H. ENFIELD M300 Country County Luth Code 23565 Database 2 NAME LOCATION ZIEFELD, DIANNE 100 ENFIELD, HARRY H 300 ｎｏｙｎｏｔ@１６３．ｃｏｍ Quality Data: Importance and Benefits  Quality data: • Key to a successful warehouse implementation  Quality data helps you in: • • • • • Targeting right customers Determining buying patterns Identifying householders: private and commercial Matching customers Identify historical data ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Quality Guidelines  Operational data:  Should not be used directly in the warehouse  Must be cleaned for each increment  Is not simply fixed by modifying applications ｎｏｙｎｏｔ@１６３．ｃｏｍ Transformation Techniques  Merging data  Adding a Date Stamp  Adding Keys to Data ｎｏｙｎｏｔ@１６３．ｃｏｍ Merging Data  Operational transactions do not usually map one-to-one with warehouse data.  Data for the warehouse is merged to provide information for analysis. Pizza sales/returns by day, hour, seconds Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:02 Anchovy Pizza $12.00 Return 1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale 12:00:04 Sausage Pizza $11.00 1/2/02 ｎｏｙｎｏｔ@１６３．ｃｏｍ Merging Data Pizza sales/returns by day, hour, seconds Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:02 Anchovy Pizza $12.00 Return 1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale 12:00:04 Sausage Pizza $11.00 1/2/02 Pizza sales Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:04 Sausage Pizza $11.00 ｎｏｙｎｏｔ@１６３．ｃｏｍ Adding a Date Stamp  Time element can be represented as a: • Single point in time • Time span  Add time element to: • Fact tables • Dimension data ｎｏｙｎｏｔ@１６３．ｃｏｍ Adding a Date Stamp: Fact Tables and Dimensions Product Table Product_id Time_key Product_desc Time Table Week_id Period_id Year_id Time_key Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Item Table Item_id Dept_id Time_key ｎｏｙｎｏｔ@１６３．ｃｏｍ Adding Keys to Data #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 #5 Sale 12:00:04 Sausage Pizza $11.00 1/2/98 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 ｎｏｙｎｏｔ@１６３．ｃｏｍ Summarizing Data 1. During extraction on staging area 2. After loading to the warehouse server Operational databases Staging area Warehouse database ｎｏｙｎｏｔ@１６３．ｃｏｍ Maintaining Transformation Metadata  Transformation metadata contains:  Transformation rules  Algorithms and routines Sources Stage Rules Extract Transform Publish Load Query ｎｏｙｎｏｔ@１６３．ｃｏｍ Maintaining Transformation Metadata  Restructure keys  Identify and resolve coding differences  Validate data from multiple sources  Handle exception rules  Identify and resolve format differences  Fix referential integrity inconsistencies  Identify summary data ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Ownership and Responsibilities  Data ownership and responsibilities should be shared by the: • Operational team • Data warehouse team  Business benefit gained with “work together” approach ｎｏｙｎｏｔ@１６３．ｃｏｍ Transformation Timing and Location  Transformation is performed: • Before load • In parallel  Can be initiated at different points: • On the operational platform • In a separate staging area ｎｏｙｎｏｔ@１６３．ｃｏｍ Choosing a Transformation Point  Workload  Impact on environment  CPU usage  Disk space • • • • Network bandwidth Parallel execution Load window time User information needs ｎｏｙｎｏｔ@１６３．ｃｏｍ Monitoring and Tracking  Transformations should:  Be self-documenting  Provide summary statistics  Handle process exceptions ｎｏｙｎｏｔ@１６３．ｃｏｍ Designing Transformation Processes  Analysis: • Sources and target mappings, business rules • Key users, metadata, grain  Design options: • • • • Third-party tools Custom 3GL programs(FORTRAN、Cobol、C、C++、JAVA ) 4GLs like SQL or PL/SQL Replication  Design issues: • Performance • Size of the staging area • Exception handling, integrity maintenance ｎｏｙｎｏｔ@１６３．ｃｏｍ Transformation Tools  Third-party tools  SQL*Loader  In-house developed programs ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary  In this lesson, you should have learned how to:  Define transformation  Identify possible staging models  Identify data anomalies and eliminate them  Explain the importance of quality data  Describe techniques for transforming data  Design transformation process ｎｏｙｎｏｔ@１６３．ｃｏｍ Practice 5-1 Overview  This practice covers the following topics:  Answering a series of questions based on the business scenario for Frontier Airways  Answering a series of short questions ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 6 Building the Data Warehouse: Loading Warehouse Data Objectives  After completing this lesson, you should be able to do the following:  Explain key concepts in loading warehouse data  Outline how to build the loading process for the initial load  Identify loading techniques  Describe the loading techniques provided by Oracle  Identify the tasks that take place after data is loaded  Explain the issues involved in designing the transportation, loading, and scheduling processes ｎｏｙｎｏｔ@１６３．ｃｏｍ Loading Data into the Warehouse  Loading moves the data into the warehouse  Loading can be time-consuming: • Consider the load window • Schedule and automate the loading  Initial load moves large volumes of data  Subsequent refresh moves smaller volumes of data Transform Extract Operational databases Transport, Load Staging area Warehouse database ｎｏｙｎｏｔ@１６３．ｃｏｍ Initial Load and Refresh  Initial Load:  Single event that populates the database with historical data  Involves large volumes of data  Employs distinct ETL tasks  Involves large amounts of processing after load  Refresh:  Performed according to a business cycle  Less data to load than first-time load  Less-complex ETL tasks  Smaller amounts of post-load processing ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Refresh Models: Extract Processing Environment  After each time interval, build a new snapshot of the database.  Purge old snap shots. Operational databases T1 T2 T3 ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Refresh Models: Warehouse Processing Environment  Build a new database.  After each time interval, add changes to database.  Archive or purge oldest data. Operational databases T1 T2 T3 ｎｏｙｎｏｔ@１６３．ｃｏｍ Building the Loading Process  Techniques and tools  File transfer methods  The load window  Time window for other tasks  First-time and refresh volumes  Frequency of the refresh cycle  Connectivity bandwidth ｎｏｙｎｏｔ@１６３．ｃｏｍ Building the Loading Process  Test the proposed technique  Document proposed load  Monitor, review, and revise ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Granularity  Important design and operational issue  Low-level grain: Expensive, high level of processing, more disk space, more details  High-level grain: Cheaper, less processing, less disk space, little details ｎｏｙｎｏｔ@１６３．ｃｏｍ Loading Techniques  Tools  Utilities and 3GL  Gateways  Customized copy programs  Replication  FTP  Manual ｎｏｙｎｏｔ@１６３．ｃｏｍ Loading Technique Considerations  Tools are comprehensive, but costly.  Data-movement utilities are fast and powerful.  Gateways are suitable for specific instances: • • • • Access other databases Supply dependent data marts Support a distributed environment Provide real-time access if needed  Use customized programs as a last resort.  Replication is limited by data-transfer rates. ｎｏｙｎｏｔ@１６３．ｃｏｍ Post-Processing of Loaded Data Transform Extract Load Staging area Create indexes Warehouse Generate keys Post-processing of loaded data Summarize Filter ｎｏｙｎｏｔ@１６３．ｃｏｍ Indexing Data  Before load: Enable indexes at server  During load: Adds time to load window, row-by-row approach  After load: Adds time to load window, but faster than row-by- row approach Index Operational databases Staging area Warehouse database ｎｏｙｎｏｔ@１６３．ｃｏｍ Unique Indexes  Disable constraints before load.  Enable constraints after load.  Re-create index if necessary. Disable constraints Enable constraints Load data Create index Catch errors Reprocess ｎｏｙｎｏｔ@１６３．ｃｏｍ Creating Derived Keys  The use of derived or generalized keys is recommended to maintain the uniqueness of a row.  Methods: • Concatenate operational key with a number • Assign a number sequentially from a list 109908 109908 01 109908 100 ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary Management  Summary tables  Materialized views Summary data ｎｏｙｎｏｔ@１６３．ｃｏｍ Filtering Data  From warehouse to data marts Summary data Warehouse Data marts ｎｏｙｎｏｔ@１６３．ｃｏｍ Verifying Data Integrity  Load data into intermediate file.  Compare target flash totals with totals before load. Counts & Amounts = Flash Totals Load Intermediate file Target Counts & Amounts Flash Totals = Preserve, inspect, fix, then load ｎｏｙｎｏｔ@１６３．ｃｏｍ Steps for Verifying Data Integrity Source files Source files Source files Control Target 3 4 1 Extract SQL*Loader 5 2 6 7 .log .bad ｎｏｙｎｏｔ@１６３．ｃｏｍ Standard Quality Assurance Checks  Load status  Completion of the process  Completeness of the data  Data reconciliation  Referential integrity violations  Reprocessing  Comparison of counts and amounts 1+1=3 ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary  In this lesson, you should have learned how to:  Explain key concepts in loading data into the warehouse  Outline how to build the loading process for the initial load  Identify loading techniques  Describe the loading techniques provided by Oracle  Identify the tasks that take place after data is loaded  Explain the issues involved in designing the transportation, loading, and scheduling processes ｎｏｙｎｏｔ@１６３．ｃｏｍ Practice 6-1 Overview  This practice covers the following topics:  Answering a series of short questions  Answering questions based on the business scenario for Frontier Airways ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 7 Refreshing Warehouse Data Objectives  After completing this lesson, you should be able to do the following:  Describe methods for capturing changed data  Explain techniques for applying the changes  Describe Change Data Capture mechanism and refresh mechanisms supported in Oracle9i  Describe techniques for purging and archiving data and outline the techniques supported by Oracle  Outline final tasks, such as publishing the data, controlling access, and automating processes  List the selection criteria for choosing ETL tools ｎｏｙｎｏｔ@１６３．ｃｏｍ Developing a Refresh Strategy for Capturing Changed Data  Consider load window.  Identify data volumes.  Identify cycle.  Know the technical infrastructure.  Plan a staging area.  Determine how to detect changes. Operational databases T1 T2 T3 ｎｏｙｎｏｔ@１６３．ｃｏｍ User Requirements and Assistance  Users define the refresh cycle.  IT balances requirements against technical issues.  Document all tasks and processes.  Employ user skills. Operational databases T1 T2 T3 ｎｏｙｎｏｔ@１６３．ｃｏｍ Load Window Requirements  Time available for entire ETL process  Plan  Test  Prove  Monitor Load Window 0 3 am 6 User Access Period Load Window 9 12 pm 3 6 9 12 ｎｏｙｎｏｔ@１６３．ｃｏｍ Planning the Load Window  Plan and build processes according to a strategy.  Consider volumes of data.  Identify technical infrastructure.  Ensure currency of data.  Consider user access requirements first.  High availability requirements may mean a small load window. User Access Period 0 3 am 6 9 12 pm 3 6 9 12 ｎｏｙｎｏｔ@１６３．ｃｏｍ Scheduling the Load Window 1 Receive data FTP 0 2 Requirements Load cycle Control File File names File types Number of files Number of loads First-time load or refresh Date of file Date range Records in file - counts Totals – amounts 4 Control process Open and read files to verify and analyze 3 3 a.m. ｎｏｙｎｏｔ@１６３．ｃｏｍ Scheduling the Load Window 5 Load into warehouse 6 8 Verify, analyze, reapply Create summaries 7 Index data 9 Update metadata Parallel load 3 a.m. 6 a.m. 9 a.m. ｎｏｙｎｏｔ@１６３．ｃｏｍ Scheduling the Load Window 11 10 Backup warehouse 6 a.m. Create views for specialized tools 12 Users access summary data 13 Publish 9 a.m. User access ｎｏｙｎｏｔ@１６３．ｃｏｍ Capturing Changed Data for Refresh  Capture new fact data  Capture changed dimension data  Determine method of capture in each case  Methods: • • • • • Wholesale data replacement Comparison of database instances Time stamping Database triggers Database log ｎｏｙｎｏｔ@１６３．ｃｏｍ Wholesale Data Replacement  Expensive  Useful for data marts with less data  Limited historical data analysis is possible  Time period often exceeds load window  Mirroring techniques can be used to provide access to the users ｎｏｙｎｏｔ@１６３．ｃｏｍ Comparison of Database Instances  Delta file: • Changes to operational data since last refresh • Used to update the warehouse  Simple to perform, but expensive in terms of time and processing  Efficient for smaller volumes of data Yesterday’s operational database Today’s operational database Database comparison Delta file holds changed data ｎｏｙｎｏｔ@１６３．ｃｏｍ Time and Date Stamping  Fast scanning for records changed since last refresh cycle  Useful for data with updated date field  No detection of deleted data Operational data Delta file holds changed data based on time stamp ｎｏｙｎｏｔ@１６３．ｃｏｍ Database Triggers  Changed data intersected at the server level  Extra I/O required  Maintenance overhead Operational data Operational server (RDBMS) Delta file holds changed data Triggers on server ｎｏｙｎｏｔ@１６３．ｃｏｍ Using a Database Log  Contains before and after images  Requires system checkpoint  Common technique Operational data Operational server (DBMS) Log Log analysis and data extraction Delta file holds changed data ｎｏｙｎｏｔ@１６３．ｃｏｍ Choosing a Method for Change Data Capture  Consider each method on merit.  Consider a hybrid approach if one approach is not suitable.  Consider current technical, operational, and application issues. ｎｏｙｎｏｔ@１６３．ｃｏｍ Applying the Changes to Data  You have a choice of techniques:  Overwrite a record  Add a record  Add a field  Maintain history  Add version numbers ｎｏｙｎｏｔ@１６３．ｃｏｍ Overwriting a Record  Easy to implement  Loses all history  Not recommended 42135 John Doe Married 42135 John Doe Single ｎｏｙｎｏｔ@１６３．ｃｏｍ Adding a New Record  History is preserved; dimensions grow.  Time constraints are not required.  Generalized key is created.  Metadata tracks usage of keys. 42135 42135_01 John Doe John Doe Single Married ｎｏｙｎｏｔ@１６３．ｃｏｍ Adding a Current Field  Maintains some history  Loses intermediate values  Is enhanced by adding an Effective Date field 42135 John Doe Single 42135 John Doe Single Married 1-Jan-01 ｎｏｙｎｏｔ@１６３．ｃｏｍ Limitations of Methods for Applying Changes  Difficult to maintain History  Dimensions may grow large  Maintenance overhead ｎｏｙｎｏｔ@１６３．ｃｏｍ Maintaining History: Techniques  History tables  One-to-many relationships  Versioning  Preserve complete history ｎｏｙｎｏｔ@１６３．ｃｏｍ Maintaining History: Techniques  History tables:  Normalize dimensions  Hold current and historical data  One-to-many relationships:  One current record and many history records HIST_CUST Time CUSTOMER Sales Product ｎｏｙｎｏｔ@１６３．ｃｏｍ Versioning  Avoid double counting  Facts hold version number Time Customer.CustId Version Customer Name 1234 1 Comer 1234 2 Comer Sales.CustId Version Sales Facts 1234 1 $11,000 1234 2 $12,000 Customer Sales Product ｎｏｙｎｏｔ@１６３．ｃｏｍ Preserve Complete History  Complete history: • Enables realistic historical analysis • Retains context of data  Model must be able to: • Reflect business changes • Maintain context between fact and dimension data • Retain sufficient data to relate old to new ｎｏｙｎｏｔ@１６３．ｃｏｍ Purging and Archiving Data  As data ages, its value depreciates.  Remove old data from the warehouse: • Archive for later use (if needed) • Purge without copy ｎｏｙｎｏｔ@１６３．ｃｏｍ Final Tasks  Update metadata  Publish data  Use database roles to control access to the warehouse Sources Stage Rules Publish Extract Transform Load Query ｎｏｙｎｏｔ@１６３．ｃｏｍ Publishing Data  Control access using database roles  Compromise between load action and user access  Consider: • Staggering updates • Using temporary tables • Using separate tables ｎｏｙｎｏｔ@１６３．ｃｏｍ ETL Tools: Selection Criteria  Overlap with existing tools  Availability of meta model  Supported data sources  Ease of modification and maintenance  Required fine tuning of code  Ease of change control  Power of transformation logic  Level of modularization  Power of error, exception, resubmission features  Intuitive documentation  Performance of code ｎｏｙｎｏｔ@１６３．ｃｏｍ ETL Tool Selection Criteria  Activity scheduling and sophistication  Metadata generation  Learning curve  Flexibility  Supported operating systems  Cost ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary  In this lesson, you should have learned how to:  Describe methods for capturing changed data  Explain techniques for applying the changes  Describe Change Data Capture mechanism and refresh mechanisms supported in Oracle9i  Describe techniques for purging and archiving data and outline the techniques supported by Oracle  Outline final tasks, such as publishing the data, controlling access, and automating processes  List the selection criteria for choosing ETL tools ｎｏｙｎｏｔ@１６３．ｃｏｍ Practice 7-1 Overview  This practice covers the following topics:  Answering a series of questions based on the business scenario for Frontier Airways  Answering a series of short questions ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 8 Leaving a Metadata Trail Objectives  After completing this lesson, you should be able to do the following:  Define warehouse metadata, its types, and its role in a warehouse environment  Examine each type of warehouse metadata  Develop a metadata strategy  Outline the Common Warehouse Metamodel (CWM) ｎｏｙｎｏｔ@１６３．ｃｏｍ Defining Warehouse Metadata  Data about warehouse data and processing  Vital to the warehouse  Used by everyone  The key to understanding warehouse information Metadata ｎｏｙｎｏｔ@１６３．ｃｏｍ Metadata Users End users Metadata repository Developers IT Professionals ｎｏｙｎｏｔ@１６３．ｃｏｍ Types of Metadata  End-user metadata: • Key to a good warehouse • Navigation aid • Information provider  ETL metadata: • • • • Maps structure Source and target information Transformations Context  Operational metadata: • Load, management, scheduling processes • Performance ｎｏｙｎｏｔ@１６３．ｃｏｍ Examining Types of Metadata  ETL metadata  End-user metadata Metadata repository External sources Operational data sources ETL End user Warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Examining Metadata: ETL Metadata  Business rules  Source tables, fields, and key values  Ownership  Field conversions  Encoding and reference table  Name changes  Key value changes  Default values  Logic to handle multiple sources  Algorithms  Time stamp ｎｏｙｎｏｔ@１６３．ｃｏｍ Extraction Metadata  Space and storage requirements  Source location information  Diverse source data  Access information  Security  Contacts Extraction  Program names  Frequency details  Failure procedures  Validity checking information Metadata repository ｎｏｙｎｏｔ@１６３．ｃｏｍ Transformation Metadata  Duplication routines  Exception handling  Key restructuring  Grain conversions  Program names  Frequency  Summarization Metadata repository Transformation ｎｏｙｎｏｔ@１６３．ｃｏｍ Loading Metadata  Method of transfer  Frequency  Validation procedures  Failure procedures  Deployment rules  Contact information Metadata repository Loading ｎｏｙｎｏｔ@１６３．ｃｏｍ Examining Metadata: End-User Metadata  Location of fact and dimensions  Availability  Description of contents and algorithms used for derived and summary data Metadata  Data ownership details repository End users ｎｏｙｎｏｔ@１６３．ｃｏｍ End-User Metadata: Context  Need to know the context of the table queried  Associate the metadata with its description Metadata repository End users ｎｏｙｎｏｔ@１６３．ｃｏｍ Example of End-User Metadata Table Name Column Name Data Meaning Product Prod_ID 739516 Unique identifier for the product Product Valid_date 01/97 Last refresh date Product Ware_loc 1816 Warehouse location number Product Ware_bin 666 Warehouse bin number Product Code 15 The color of the product; please refer to table COL_REF for details Product Weight 17.62 Packed shipping weight in kilograms ｎｏｙｎｏｔ@１６３．ｃｏｍ Historic Context of Data  Supports change history  Maintains the context of information Metadata repository End users 1994 1995 1996 1997 1998 ｎｏｙｎｏｔ@１６３．ｃｏｍ Types of Context  Simple: • Data structures • Naming conventions • Metrics Metadata repository  Complex: • Product definitions • Markets • Pricing  External: • Economic • Political End users 1994 1995 1996 1997 1998 ｎｏｙｎｏｔ@１６３．ｃｏｍ Developing a Metadata Strategy  Define a strategy to ensure high-quality metadata useful to users and developers.  Primary strategy considerations: • • • • • • • • Define goals and intended use Identify target users Choose tools and techniques Choose the metadata location Manage the metadata Manage access to the metadata Integrate metadata from multiple tools Manage change ｎｏｙｎｏｔ@１６３．ｃｏｍ Defining Metadata Goals and Intended Usage  Define clear goals.  Identify requirements.  Identify intended usage. Metadata ｎｏｙｎｏｔ@１６３．ｃｏｍ Identifying Target Metadata Users  Who are the metadata users? • Developers • End users  What information do they need?  How will they access the metadata? ｎｏｙｎｏｔ@１６３．ｃｏｍ Choosing Metadata Tools and Techniques  Tools: • Data modeling • ETL • End user (query and analysis)  Database schema definitions  COBOL copybooks  Middleware tools ｎｏｙｎｏｔ@１６３．ｃｏｍ Choosing the Metadata Location  Usually the warehouse server  Possibly on operational platforms  Desktop tool with metalayer Metadata ｎｏｙｎｏｔ@１６３．ｃｏｍ Managing the Metadata  Managed by the metadata manager  Maintained by the metadata architect  Standards should be followed ｎｏｙｎｏｔ@１６３．ｃｏｍ Integrating Multiple Sets of Metadata  Multiple tools may generate their own metadata.  These metalayers should be properly integrated.  Metadata exchangeability is desirable. ｎｏｙｎｏｔ@１６３．ｃｏｍ Managing Changes to Metadata  Different types of metadata have different rates of change.  Consider metadata changes resulting from refresh cycles. ｎｏｙｎｏｔ@１６３．ｃｏｍ Additional Metadata Content and Considerations  Summarization algorithms  Relationships  Stewardship  Permissions  Pattern analysis  Reference tables ｎｏｙｎｏｔ@１６３．ｃｏｍ Common Warehouse Metamodel Design and Administration Analytic applications Any source ERP Operational External Warehouse Data integration Information delivery Any access Reporting Ad hoc query & analysis Data mining Marts CWM metadata repository ｎｏｙｎｏｔ@１６３．ｃｏｍ Summary  In this lesson, you should have learned how to:  Define warehouse metadata, its types, and its role in a warehouse environment  Examine each type of warehouse metadata  Develop a metadata strategy  Outline the Common Warehouse Metamodel (CWM) ｎｏｙｎｏｔ@１６３．ｃｏｍ Practice 8-1 Overview  This practice covers the following topics:  Answering a series of short questions  Answering questions based on the business scenario for Frontier Airways ｎｏｙｎｏｔ@１６３．ｃｏｍ Lesson 9 Managing and Maintaining the Data Warehouse Objectives  After completing this lesson, you should be able to do the following:  Develop a plan for managing the transition from development to implementation  Identify challenges pertaining to the growth of the data warehouse  Describe backup and archive mechanisms  Identify data warehouse performance issues ｎｏｙｎｏｔ@１６３．ｃｏｍ Managing the Transition to Production  Promoting support for change  Pilot versus large-scale implementation  Documentation  Testing  Training  Post-implementation support  Maintaining the warehouse ｎｏｙｎｏｔ@１６３．ｃｏｍ Promoting Support for the Data Warehouse  Awareness  Feedback  Information  Skills  Education  Direction  Control ｎｏｙｎｏｔ@１６３．ｃｏｍ Choosing Between Pilot and Large-Scale Implementation Pilot Large-Scale Implementation ｎｏｙｎｏｔ@１６３．ｃｏｍ The Warehouse Pilot  Demonstrates benefits to: • Management • Users • IT staff  Relevant to the business  Low technical risk  Small and feasible  Anticipates increased use  Focused on an initial business issue  Remains in context ｎｏｙｎｏｔ@１６３．ｃｏｍ Piloting the Warehouse  Designers: • Prove model, data, and access tools  Users: • Prove ease of use of tool • Check data and query performance • Identify training requirements  Developers: • Resolve ETL and metadata issues • Determine users data and training requirements • Test security and access levels, monitor performance ｎｏｙｎｏｔ@１６３．ｃｏｍ Documentation  Produces textual deliverables:  Glossary  User and technical documentation  Online help  Metadata reference guide  Warehouse management reference  New features guide ｎｏｙｎｏｔ@１６３．ｃｏｍ Testing the Warehouse  Test every stage.  Use a realistic test database and environment. ｎｏｙｎｏｔ@１６３．ｃｏｍ Training  Users: • • • • • Metadata DSS tools Ad hoc queries Getting help Registration of enhancement requests  Information systems developers: • • • • Analysis techniques Hardware technicalities Networking Implementing, building, and supporting DSS ｎｏｙｎｏｔ@１６３．ｃｏｍ Post-Implementation Support  Evaluate and review the implementation.  Monitor the warehouse: • Respond to problems • Conduct performance tuning • Roll out metadata, queries, reports, filters, and conditions • Implement security • Incorporate new users • Distribute data marts and catalogs • Transfer ownership from IT ｎｏｙｎｏｔ@１６３．ｃｏｍ Monitoring the Success of the Data Warehouse Number of Users Initial 3 Months 6 Months 12 Months 24 Months Period After Implementation ｎｏｙｎｏｔ@１６３．ｃｏｍ Measuring the Success of the Data Warehouse  Metrics may include:  Availability  Response time  Response to problems ｎｏｙｎｏｔ@１６３．ｃｏｍ Managing Growth  Increasing number of users  Broader usage  Growth of data volumes Period after Implementation ｎｏｙｎｏｔ@１６３．ｃｏｍ Expansion and Adjustment  Evaluate continually: • • • • Changes New increments Unnecessary components Strategies  Ensure open environment  Document development processes for the future: • • • • Planning Cost analysis Problem assessment and correction Performance assessment ｎｏｙｎｏｔ@１６３．ｃｏｍ Controlling Expansion  Ensure the continuity of staff.  Document processes, solutions, and metrics.  Establish working test and production architecture for further increments.  Create a strategy for maintaining changes to data. ｎｏｙｎｏｔ@１６３．ｃｏｍ Sizing Storage  Consider different methods.  Determine the best for your needs.  Know the business requirements.  Do not underestimate requirements.  Plan for growth.  Consider space for unwanted data. ｎｏｙｎｏｔ@１６３．ｃｏｍ Estimating Storage  Fact volumes  Fact lifetime  Technology availability  Technology purchase  Storing presummarized data  Mirroring or other techniques requiring disk storage ｎｏｙｎｏｔ@１６３．ｃｏｍ Objects That Need Space  ODS  Indexes and metadata  Summary data  Redo logs  Rollback information  Sort areas  Temporary space  Workspace for backup and recovery ｎｏｙｎｏｔ@１６３．ｃｏｍ Other Considerations and Techniques  Queuing models  Rule of thumb  Total database size is three to four times the size of the base fact tables  Consider: • • • • • Sparseness Dimensions Indexes Summaries Sort operational space ｎｏｙｎｏｔ@１６３．ｃｏｍ Space Management  Monitor  Avoid fragmentation  Test load data  Plan for growth  Know business patterns  Never let space become an issue ｎｏｙｎｏｔ@１６３．ｃｏｍ Archiving Data  Determine data life expectancy.  Identify archive frequency.  Use read-only tablespaces.  Include in early specifications. ｎｏｙｎｏｔ@１６３．ｃｏｍ Purging Data  Reduce data volumes: • Create summaries • Remove unwanted base data  Choose the most effective method. ｎｏｙｎｏｔ@１６３．ｃｏｍ Identifying Data Warehouse Performance Issues  Improving query efficiency: • • • • Use indexes. Use query governors. Run large jobs out of hours. Consider a data mart approach.  Improving network performance: • Provide sufficient bandwidth and optimize configuration for access. • Analyze traffic. • Deploy data marts at remote locations. ｎｏｙｎｏｔ@１６３．ｃｏｍ Review and Revise  Monitor the warehouse:  Usage  Access  Accurate grain  Detail data  Periodicity ｎｏｙｎｏｔ@１６３．ｃｏｍ Secret of Success Think big; start small! ｎｏｙｎｏｔ@１６３．ｃｏｍ Course Summary  In this course, you should have learned that the successful warehouse:  Is driven by the business  Focuses on objectives  Adds value to the business  Can be understood and used  Delivers good data  Performs well  Belongs to the users ｎｏｙｎｏｔ@１６３．ｃｏｍ Sample study Note : we discuss the creation of data marts, rather than the perhaps more familiar term, data warehouse. Data warehouses tend to be large, one-stop-shopping repositories where all the historical data for the organization would be stored. Nothing is wrong with this as a concept; however, attempting to create a data warehouse often led to huge, multiyear technology projects that were never quite finished or were outdated when they finally did get done. ｎｏｙｎｏｔ@１６３．ｃｏｍ Data Mart Structure  Measures A Measure is a numeric quantity expressing some aspect of the organization's performance. The information represented by this quantity is used to support or evaluate the decision making and performance of the organization. A measure can also be called a fact. the tables that hold measure information are known as fact tables. ｎｏｙｎｏｔ@１６３．ｃｏｍ  Dimensions A Dimension is a categorization used to spread out an aggregate measure to reveal its constituent parts. Dimensions are used to facilitate this slicing and dicing. ｎｏｙｎｏｔ@１６３．ｃｏｍｎｏｙｎｏｔ@１６３．ｃｏｍｎｏｙｎｏｔ@１６３．ｃｏｍｎｏｙｎｏｔ@１６３．ｃｏｍ  The Star Schema A Star Schema is a relational database schema used to hold measures and dimensions in a data mart. The measures are stored in a fact table and dimensions are stored in dimension tables. ｎｏｙｎｏｔ@１６３．ｃｏｍｎｏｙｎｏｔ@１６３．ｃｏｍｎｏｙｎｏｔ@１６３．ｃｏｍｎｏｙｎｏｔ@１６３．ｃｏｍ  Attributes An Attribute is an additional piece of information pertaining to a dimension member that is not the unique identifier or the description of the member. ｎｏｙｎｏｔ@１６３．ｃｏｍｎｏｙｎｏｔ@１６３．ｃｏｍ  Hierarchies A Hierarchy is a structure mode up of two or more levels of related dimensions. A dimension at an upper level of the hierarchy completely contains one or more dimensions from the next lower level of the hierarchy. ｎｏｙｎｏｔ@１６３．ｃｏｍｎｏｙｎｏｔ@１６３．ｃｏｍ  The Snowflake Schema ｎｏｙｎｏｔ@１６３．ｃｏｍ

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 幻灯片 1 - 这是一个测试