Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehousing Fundamentals noynot@163.com Course Objectives After completing this course, you should be able to do the following: Describe the role of business intelligence (BI) and data warehousing in today’s marketplace Describe data warehousing terminology and the various technologies that are required to implement a data warehouse Explain the implementation and organizational issues surrounding a data warehouse project Identify data warehouse modeling concepts Explain the extraction, transformation, and loading processes for building a data warehouse noynot@163.com Course Objectives Identify management and maintenance processes that are associated with a data warehouse project Describe methods for refreshing warehouse data Explain warehouse metadata concepts Identify tools that can be employed at each stage of the data warehouse project Describe user profiles and techniques for querying the warehouse Identify methods and tools for accessing and analyzing warehouse data noynot@163.com Lessons 1. 2. Business Intelligence and Data Warehousing Defining Data Warehouse Concepts and Terminology 3. Planning and Managing the Data Warehouse Project 4. Modeling the Data Warehouse 5. Building the Data Warehouse: Extracting Data 6. Building the Data Warehouse: Transforming Data 7. Building the Data Warehouse: Loading Warehouse Data 8. Refreshing Warehouse Data 9. Leaving a Metadata Trail 10. Managing and Maintaining the Data Warehouse noynot@163.com Let’s Get Started Lesson 1 noynot@163.com Lesson 1 Objectives After completing this lesson, you should be able to do the following: Describe the role of business intelligence in today’s marketplace Describe why an online transaction processing system (OLTP) is not suitable for analytical reporting Describe how extract processing for decision support querying led to data warehouse solutions that are employed today Explain why businesses are driven to employ data warehouse technology noynot@163.com What Is Business Intelligence? “Business Intelligence is the process of transforming data into information and through discovery transforming that information into knowledge.” Gartner Group noynot@163.com Purpose of Business Intelligence The purpose of business intelligence is to convert the volume of data into business value through analytical reporting. Decision Knowledge Information Data Value Volume noynot@163.com Early Management Information Systems MIS systems provided business data. Reports were developed on request. Reports provided little analysis capability. Decision support tools gave personal ad hoc access to data. Ad hoc access Production platforms Operational reports Decision makers noynot@163.com Analyzing Data from Operational Systems Data structures are complex. Systems are designed for high performance and throughput. Data is not meaningfully represented. Data is dispersed. OLTP systems may be unsuitable for intensive queries. Production platforms Operational reports noynot@163.com Why OLTP Is Not Suitable for Analytical Reporting OLTP Analytical Reporting Information to support day-to-day service Historical information to analyze Data stored at transaction level Data needs to be integrated Database design: Normalized Database design: Denormalized, star schema noynot@163.com Data Extract Processing End user computing offloaded from the operational environment User’s own data Operational systems Extracts Decision makers noynot@163.com Management Issues with Data Extract Programs Operational systems Extracts Decision makers Extract Explosion noynot@163.com Productivity Issues with Extract Processing Duplicated effort Multiple technologies Obsolete reports No metadata noynot@163.com Data Quality Issues with Extract Processing No common time basis Different calculation algorithms Different levels of extraction Different levels of granularity Different data field names Different data field meanings Missing information No data correction rules No drill-down capability noynot@163.com Data Warehousing and Business Intelligence Legacy Data Enterprise Data Warehouse Operations Data External Data Data Marts Analytical Reporting noynot@163.com Advantages of Warehouse Processing Environments Controlled Reliable Quality information Single source of data Internal and external systems Data warehouse Decision makers noynot@163.com Advantages of Warehouse Processing Environments No duplication of effort No need for tools to support many technologies No disparity in data, meaning, or representation No time period conflict No algorithm confusion No drill-down restrictions noynot@163.com Success Factors for a Dynamic Business Environment Know the business Reinvent to face new challenges Invest in products Invest in customers Retain customers Invest in technology Improve access to business information Provide superior services and products Be profitable noynot@163.com Business Drivers for Data Warehouses Provide supporting information systems Get quality information: • Reduce costs • Streamline the business • Improve margins noynot@163.com Technological Advances Enabling Data Warehousing Hardware Operating system Database Query tools • Applications • • • • • Large databases 64-bit architectures Indexing techniques Affordable, cost-effective open systems Robust warehouse tools Sophisticated end user tools noynot@163.com 两种数据的区别 noynot@163.com Summary In this lesson, you should have learned how to: Describe the role of business intelligence in today’s marketplace Describe why an online transaction processing system (OLTP) is not suitable for analytical reporting Describe how extract processing for decision support querying led to data warehouse solutions employed today Explain why businesses are driven to employ data warehouse technology noynot@163.com Practice 1-1 Overview This practice covers the following topics: Answering questions about data warehousing Discussing how data warehousing meets business needs noynot@163.com Lesson 2 Defining Data Warehouse Concepts and Terminology Objectives After completing this lesson, you should be able to do the following: Identify a common, broadly accepted definition of a data warehouse Describe the differences of dependent and independent data marts Identify some of the main warehouse development approaches Recognize some of the operational properties and common terminology of a data warehouse noynot@163.com Definition of a Data Warehouse “A data warehouse is a subject oriented, integrated, non-volatile, and time variant collection of data in support of management’s decisions.” — W.H. Inmon “数据仓库是一个面向主题的、集成的、随时间变化 的、非易失的、用于战略决策的数据集合” “Building the Data Warehouse”(1991) noynot@163.com Definition of a Data Warehouse “…数据仓库无非是所有数据集市的集合...” — Ralph Kimball “数据仓库是信息数据库的具体实现,用来存储源 自业务数据库的共享数据。典型的数据仓库应该是 一个主题数据库,支持用户从巨大的运营数据存储 中发现信息,支持对业务趋势进行跟踪和响应,实 现业务的预测和计划。” — DM Review noynot@163.com Definition of a Data Warehouse “An enterprise structured repository of subject-oriented, time-variant, historical data used for information retrieval and decision support. The data warehouse stores atomic and summary data.” — Oracle’s Data Warehouse Definition “数据仓库是一个过程而不是一个项目” —另一角度描述数据仓库 noynot@163.com Data Warehouse Properties Integrated Subjectoriented Data Warehouse Nonvolatile Time-variant noynot@163.com Subject-Oriented Data is categorized and stored by business subject rather than by application. OLTP Applications Data Warehouse Subject Equity Plans Shares Insurance Loans Savings Customer financial information noynot@163.com Integrated Data on a given subject is defined and stored once. Savings Current Accounts Loans OLTP Applications Customer Data Warehouse noynot@163.com Time-Variant Data is stored as a series of snapshots, each representing a period of time. Data Warehouse noynot@163.com Nonvolatile Typically data in the data warehouse is not updated or deleted. Operational Warehouse Load Insert, Update, Delete, or Read Read noynot@163.com Changing Warehouse Data Operational Databases Warehouse Database First time load Refresh Refresh Refresh Purge or Archive noynot@163.com Data Warehouse Versus OLTP Property OLTP Data Warehouse Response Time Sub seconds to seconds Seconds to hours Operations DML Primarily Read only Nature of Data 30 – 60 days Snapshots over time Data Organization Application Subject, time Size Small to large Large to very large Data Sources Operational, Internal Operational, Internal, External Activities Processes Analysis noynot@163.com Usage Curves Operational system is predictable Data warehouse: • Variable • Random noynot@163.com Enterprise wide Warehouse Large scale implementation Scopes the entire business Data from all subject areas Developed incrementally Single source of enterprisewide data Synchronized enterprisewide data Single distribution point to dependent data marts noynot@163.com 数据仓库设计中心思想 具有一个合适的粒度或细节以满足所有的数据集市 设计不能阻碍在数据集市中使用各种技术,能适应多维集市、 统计、挖掘及探索型仓库 noynot@163.com Data Marts 数据仓库数据的一个子集。 BI环境中的大部分分析活动均在数据集市中进行。每个数 据集市中的数据通常是为特定的功能所定制,不必对其他 的使用有效。 noynot@163.com Data Warehouses Versus Data Marts Property Data Warehouse Data Mart Scope Enterprise Department Subjects Multiple Single-subject, LOB Data Source Many Few Implementation time Months to years Months noynot@163.com Dependent Data Mart Data Marts Operational Systems Legacy Data Flat Files Operations Data External Data Data Warehouse Marketing Sales Finance HR External Data Marketing Sales Finance noynot@163.com Independent Data Mart Operational Systems Legacy Data Flat Files Sales or Marketing Operations Data External Data External Data noynot@163.com Features of a Data Mart Not Real-Time Data Consolidation and Cleansing noynot@163.com Warehouse Development Approaches “Big bang” approach Incremental approach: • Top-down incremental approach • Bottom-up incremental approach noynot@163.com “Big Bang” Approach Analyze enterprise requirements Build enterprise data warehouse Report in subsets or store in data marts noynot@163.com Top-Down Approach Analyze requirements at the enterprise level Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load processes for the initial subject area Create and populate the initial subject area data mart within the overall warehouse framework noynot@163.com Bottom-Up Approach Define the scope and coverage of the data warehouse and analyze the source systems within this scope Define the initial increment based on the political pressure, assumed business benefit and data volume Implement base technical architecture and establish metadata, extraction, and load processes as required by increment Create and populate the initial subject areas within the overall warehouse framework noynot@163.com Incremental Approach to Warehouse Development Multiple iterations Shorter implementations Validation of each phase Increment 1 Strategy Definition Analysis Design Iterative Build Production noynot@163.com Data Warehousing Process Components Methodology Architecture Extraction, Transformation, and Load (ETL) Implementation Operation and Support noynot@163.com Methodology Ensures a successful data warehouse Encourages incremental development Provides a staged approach to an enterprisewide warehouse: • • • • Safe Manageable Proven Recommended noynot@163.com Architecture “Provides the planning, structure, and standardization needed to ensure integration of multiple components, projects, and processes across time.” “Establishes the framework, standards, and procedures for the data warehouse at an enterprise level.” – — The Data Warehousing Institute noynot@163.com Extraction, Transformation, and Load (ETL) “Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical data warehousing project.” — DM Review, March 2001 Source Staging Area Target noynot@163.com Implementation Data Warehouse Architecture Ex., Incremental Implementation Implementation Increment 1 Increment 2 . . . Increment n noynot@163.com Operation and Support Data access and reporting Refreshing warehouse data Monitoring Responding to change noynot@163.com Phases of the Incremental Approach Strategy Definition Analysis Design Build Production Increment 1 Strategy Definition Analysis Design Build Production noynot@163.com Strategy Phase Deliverables Business goals and objectives Data warehouse purpose, objectives, and scope Enterprise data warehouse logical model Incremental milestones Source systems data flows Subject area gap analysis noynot@163.com Strategy Phase Deliverables Data acquisition strategy Data quality strategy Metadata strategy Data access environment Training strategy noynot@163.com Summary In this lesson, you should have learned how to: Identify a common, broadly accepted definition of a data warehouse Describe the differences of dependent and independent data marts Identify some of the main warehouse development approaches Recognize some of the operational properties and common terminology of a data warehouse noynot@163.com Practice 2-1 Overview This practice covers the following topics: Answering questions regarding data warehousing concept and terminology Discussing some of the data warehouse concept and terminology noynot@163.com Lesson 3 Modeling the Data Warehouse Objectives After completing this lesson, you should be able to do the following: Discuss data warehouse environment data structures Discuss data warehouse database design phases: • Defining the business model • Defining the dimensional model • Defining the physical model noynot@163.com Data Warehouse Modeling Issues Among the main issues that data warehouse data modelers face are: Different data types Many ways to use warehouse data Many ways to structure the data Multiple modeling techniques Planned replication Large volumes of data noynot@163.com Data Warehouse Environment Data Structures The data modeling structures that are commonly found in a data warehouse environment are: Third normal form (3NF) Star schema Snowflake schema noynot@163.com Star Schema Model Product Table Store Table Product_id Product_disc,... Store_id District_id,... Sales Fact Table Central fact table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Denormalized dimensions Time Table Day_id Month_id Year_id,... Item Table Item_id Item_desc,... noynot@163.com Snowflake Schema Model Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Time Table Item Table Dept Table Mgr Table Week_id Period_id Year_id Item_id Item_desc Dept_id Dept_id Dept_desc Mgr_id Dept_id Mgr_id Mgr_name noynot@163.com Snowflake Schema Model Direct use by some tools More flexible to change Provides for speedier data loading Can become large and unmanageable Degrades query performance More complex metadata Country State County City noynot@163.com Data Warehouse Database Design Phases Phase 1: Defining the business model Phase 2: Defining the dimensional model Phase 3: Defining the physical model noynot@163.com Phase 1: Defining the Business Model Performing strategic analysis Creating the business model Documenting metadata noynot@163.com Performing Strategic Analysis Identify crucial business processes Understand business processes Prioritize and select the business processes to implement High Business Benefit Low Low Feasibility High noynot@163.com Creating the Business Model Defining business requirements: • • • • Identifying the business measures Identifying the dimensions Identifying the grain Identifying the business definitions and rules Verifying data sources noynot@163.com Business Requirements Drive the Design Process Primary input Business Requirements Secondary input Existing Metadata Production ERD Model Research noynot@163.com Identifying Measures and Dimensions Measures The attribute varies continuously: Balance Units Sold Cost Sales The attribute is perceived as constant or discrete: Product Location Time Size Dimensions noynot@163.com Using a Business Process Matrix Business Processes Business Dimensions Sales Returns Inventory Customer Date Product Channel Promotion Sample of business process matrix noynot@163.com Determining Granularity YEAR? QUARTER? MONTH? WEEK? DAY? noynot@163.com Identifying Business Rules Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Time Month > Quarter > Year Product Type Monitor Status PC Server 15 inch 17 inch 19 inch None New Rebuilt Custom Store Store > District > Region noynot@163.com Documenting Metadata Documenting metadata should include: Documenting the design process Documenting the development process Providing a record of changes Recording enhancements over time noynot@163.com Metadata Documentation Approaches Automated • Data modeling tools • ETL tools • End-user tools Manual noynot@163.com Phase 2: Defining the Dimensional Model Identify fact tables: • Translate business measures into fact tables • Analyze source system information for additional measures Identify dimension tables Link fact tables to the dimension tables Model the time dimension noynot@163.com Star Dimensional Modeling Product Table Product_id Product_desc ... Time Table Day_id Month_id Period_id Year_id Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units ... Store Table Store_id District_id ... Item Table Item_id Item_desc ... noynot@163.com Fact Table Characteristics Contain numerical metrics of the business Can hold large volumes of data Can grow quickly Can contain base, derived, and summarized data Sales Fact Table Are typically additive Product_id Are joined to dimension Store_id tables through foreign keys Item_id that reference primary Day_id keys in the dimension tables Sales_amount Sales_units ... noynot@163.com Dimension Table Characteristics Dimension tables have the following characteristics: Contain textual information that represents the attributes of the business Contain relatively static data Are joined to a fact table through a foreign key reference noynot@163.com Star Dimensional Model Characteristics The model is easy for users to understand. Primary keys represent a dimension. Nonforeign key columns are values. Facts are usually highly normalized. Dimensions are completely denormalized. Fast response to queries is provided. Performance is improved by reducing table joins. End users can express complex queries. Support is provided by many front-end tools. noynot@163.com Using Time in the Data Warehouse Defining standards for time is critical. Aggregation based on time is complex. noynot@163.com The Time Dimension Time is critical to the data warehouse. A consistent representation of time is required for extensibility. Sales fact Time dimension Where should the element of time be stored? noynot@163.com Using Data Modeling Tools Tools with a GUI enable definition, modeling, and reporting. Avoid a mix of modeling techniques caused by: • Development pressures • Developers with lack of knowledge • No strategy Determine a strategy. Write and publish formally. Make available electronically. noynot@163.com Phase 3: Defining the Physical Model Translate the dimensional design to a physical model for implementation. Define storage strategy for tables and indexes. Perform database sizing. Define initial indexing strategy. Define partitioning strategy. Update metadata document with physical information. noynot@163.com Physical Model Design Tasks Define naming and database standards. Perform database sizing. Develop initial indexing strategy. Develop data partition strategy. Define storage parameters. Set initialization parameters. Use parallel processing. Define summary data. Determine hardware architecture. noynot@163.com Database Object Naming Conventions Develop a reasonable list of abbreviations. List all the objects’ names, and work with the user community to define them. Resolve name disputes. Document your naming standards in the metadata document. Plan for the naming standards to be a living document. noynot@163.com Architectural Requirements Scalability Manageability Availability Extensibility Integration Flexibility User Budget Business Technology noynot@163.com Strategy for Architecture Definition Obtain existing architecture plans. Obtain existing capacity plans. Document existing interfaces. Prepare capacity plan. Prepare technical architecture. Document operating system requirements. Develop recovery plans. Develop security and control plans. Create architecture. Create technical risk assessment. noynot@163.com Hardware Requirements SMP Cluster MPP Hybrids (employing both SMP and MPP) noynot@163.com Making the Right Choice Requirements differ from operational systems. Benchmark • Available from vendors • Develop your own • Use realistic queries Scalability is important. noynot@163.com Storage and Performance Considerations Database sizing • Test Load Sampling Data partitioning • Horizontal partitioning • Vertical partitioning Indexing • B-Tree indexes • Bitmap indexes • Bitmap-join indexes Star query optimization • Star transformation noynot@163.com Database Sizing Sizing influences capacity planning and systems environment management. Sizing is required for: • The database • Other storage areas Sizing is not an exact science. Techniques vary. noynot@163.com Test Load Sampling Analyze a representative sample of the data chosen using proven statistical methods. Ensure that the sample reflects: Test loads for different periods Day-to-day operations Seasonal data and worst-case scenarios Indexes and summaries noynot@163.com Data Partitioning Breaking up of data into separate physical units that can be handled independently Data partitioning provides ease of: • • • • • • • • Restructuring Reorganization Removal Recovery Monitoring Management Archiving Indexing noynot@163.com Horizontal Partitioning Table and index data are split by: • • • • • Time Sales region or person Geography Organization Line of business Candidate columns appear in a WHERE clause. Analysis determines requirements. noynot@163.com Vertical Partitioning You can use vertical partitioning when: • • • • Speed of query and update actions are improved by it Users require access to specific columns Some data is changed infrequently Descriptive dimension text may be better moved away from the dimension itself noynot@163.com Partitioning Methods Range partitioning List partitioning Hash partitioning Composite partitioning • Composite range-hash partitioning • Composite range-list partitioning Index partitioning noynot@163.com Indexing Indexing is used for the following reasons: It is a huge cost saving, greatly improving performance and scalability. It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed. noynot@163.com B-Tree Index Most common type of indexing Used for high cardinality columns Designed for few rows returned noynot@163.com Bitmap Indexes Provide performance benefits and storage savings Store values as 1s and 0s Use instead of B-tree indexes when: • Tables are large • Columns have relatively low cardinality noynot@163.com Bitmap Join Indexes A bitmap index for the join of two or more tables: They are new to Oracle9i. They provide better performance and storage savings. noynot@163.com Parallelism Sales table Customers table P1 P2 P3 P1 P2 P3 Parallel Execution Servers noynot@163.com Using Summary Data Designing summary tables offers the following benefits: Provides fast access to precomputed data Reduces use of I/O, CPU, and memory noynot@163.com Summary In this lesson, you should have learned how to: Describe Data Warehouse Environment Data Structures Define the business model: • Performing strategic analysis • Creating the business model • Identifying business rules Define the dimensional model: • Star dimensional model characteristics Define the physical model: • Physical model design tasks • Architectural and hardware requirements • Storage and performance considerations noynot@163.com Practice 3-1 Overview This practice covers the following topics: Specifying true or false to a series of statements Completing a series of sentences accurately Practicing identifying a simple business model Identifying indexing method noynot@163.com Lesson 4 Building the Data Warehouse: Extracting Data Objectives After completing this lesson, you should be able to do the following: Outline the ETL (Extraction, Transformation, and Loading) processes for building a data warehouse Identify ETL tasks, importance, and cost Explain how to examine data sources Identify extraction techniques and methods Identify analysis issues and design options for extraction processes List the selection criteria for the ETL tools noynot@163.com Extraction, Transformation, Loading (ETL) Processes Extract source data Transform/clean data Index and summarize Load data into warehouse Detect changes Refresh data Operational systems Programs Gateways Tools Data Warehouse ETL noynot@163.com ETL: Tasks, Importance, and Cost Operational systems Extract Clean up Consolidate Restructure Load Maintain Refresh Data Warehouse ETL Relevant Useful Quality Accurate Accessible noynot@163.com Extracting Data Source systems • Data from various data sources in various formats Extraction Routines • Developed to select data fields from sources • Consist of business rules, audit trails, error correction facilities Data mapping Transform Operational databases Data staging area Warehouse database noynot@163.com Examining Data Sources Production Archive Internal External noynot@163.com Production Data Operating system platforms File systems Database systems and vertical applications IMS DB2 Oracle Sybase Informix VSAM SAP Shared Medical Systems Dun and Bradstreet Financials Hogan Financials Oracle Financials noynot@163.com Archive Data Historical data Useful for analysis over long periods of time Useful for first-time load May require unique transformations Operation databases Warehouse database noynot@163.com Internal Data Planning, sales, and marketing organization data Maintained in the form of: • Spreadsheets (structured) • Documents (unstructured) Treated like any other source data Planning Marketing Accounting Warehouse database noynot@163.com External Data Information from outside the organization Issues of frequency, format, and predictability Described and tracked using metadata Purchased databases A.C. Nielsen, IRI, IMS, Walsh America Dun and Bradstreet Competitive information Economic forecasts Barron's Warehousing databases Wall Street Journal noynot@163.com Mapping Data Mapping data defines: Which operational attributes to use How to transform the attributes for the warehouse Where the attributes exist in the warehouse File A F1 F2 F3 123 Bloggs 10/12/56 Staging File One Number USA123 Name Mr. Bloggs DOB 10-Dec-56 Metadata File A F1 Staging File One Number F2 F3 Name DOB noynot@163.com Extraction Techniques Programs: C, C++, COBOL, PL/SQL, Java Gateways: transparent database access Tools: • In-house developed tools • Vendor’s data extraction tools noynot@163.com Extraction Methods Logical Extraction methods: • Full Extraction • Incremental Extraction Physical Extraction methods: • Online Extraction • Offline Extraction noynot@163.com Designing Extraction Processes Analysis: • Sources, technologies • Data types, quality, owners Design options: • Manual, custom, gateway, third-party • Replication, full, or delta refresh Design issues: • Volume and consistency of data • Automation, skills needed, resources noynot@163.com Maintaining Extraction Metadata Source location, type, structure Access method Privilege information Temporary storage Failure procedures Validity checks Handlers for missing data noynot@163.com Extraction Tools noynot@163.com Selection Criteria Base functionality Interface features Metadata repository Open API Metadata access Repository utilities Input and output processing Cleansing, reformatting, and auditing References Training requirements noynot@163.com Possible ETL Failures A missing source file A system failure Inadequate metadata Poor mapping information Inadequate storage planning A source structural change No contingency plan Inadequate data validation noynot@163.com Maintaining ETL Quality ETL must be: • Tested • Documented • Monitored and reviewed Disparate metadata must be coordinated. noynot@163.com Summary In this lesson, you should have learned how to: Outline the ETL (Extraction, Transformation, and Loading) processes for building a data warehouse Identify ETL tasks, importance, and cost Explain how to examine data sources Identify extraction techniques and methods Identify analysis issues and design options for extraction processes List the selection criteria for the ETL tools Identify Oracle’s solution for ETL process noynot@163.com Practice 4-1 Overview This practice covers the following topics: Answering a series of short questions Answering questions based on the business scenario for Frontier Airways noynot@163.com Lesson 5 Building the Data Warehouse: Transforming Data Objectives After completing this lesson, you should be able to do the following: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data Describe techniques for transforming data Design transformation process noynot@163.com Transformation Transformation eliminates anomalies from operational data: Cleans and standardizes Presents subject-oriented data Extract Transform: Clean up Consolidate Restructure Warehouse Operational systems Load Data Staging Area noynot@163.com Possible Staging Models Remote staging model Onsite staging model noynot@163.com Remote Staging Model Data staging area within the warehouse environment Transform Extract Operational system Load Staging area Warehouse Data staging area in its own environment Transform Operational system Extract Load Staging area Warehouse noynot@163.com On-site Staging Model Data staging area within the operational environment, possibly affecting the operational system Transform Extract Operational system Load Staging area Warehouse noynot@163.com Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies CUSNUM NAME ADDRESS 90233479 Oracle Limited 100 N.E. 1st St. 90233489 Oracle Computing 15 Main Road, Ft. Lauderdale 90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA 90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA noynot@163.com Transformation Routines Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load noynot@163.com Transforming Data: Problems and Solutions Multipart keys Multiple local standards Multiple files Missing values Duplicate values Element names Element meanings Input formats Referential Integrity constraints Name and address noynot@163.com Multipart Keys Problem Multipart keys Product code = 12 M 654313 45 Salesperson code Country Sales code territory Product number noynot@163.com Multiple Local Standards Problem Multiple local standards Tools or filters to preprocess cm DD/MM/YY 1,000 GBP inches MM/DD/YY FF 9,990 cm DD-Mon-YY USD 600 noynot@163.com Multiple Files Problem Added complexity of multiple source files Start simple Multiple source files Logic to detect correct source Transformed data noynot@163.com Missing Values Problem Solution: Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘A’ A noynot@163.com Duplicate Values Problem Solution: SQL self-join techniques RDMBS constraint utilities ACME Inc ACME Inc ACME Inc SQL> 2 3 4 5 6 7 SELECT ... FROM table_a, table_b WHERE table_a.key (+)= table_b.key UNION SELECT ... FROM table_a, table_b WHERE table_a.key = table_b.key (+); noynot@163.com Element Names Problem Solution: Common naming conventions Customer Client Customer Contact Name noynot@163.com Element Meaning Problem Avoid misinterpretation Complex solution Document meaning in metadata Customer’s name All customer details All details except name Customer_detail noynot@163.com Input Format Problem EBCDIC “123-73” ASCII 12373 ACME Co. áøåëéí äáàéí Beer (Pack of 8) noynot@163.com Referential Integrity Problem Solution: SQL anti-join Server constraints Dedicated tools Department Emp Name Department 10 1099 Smith 10 20 1289 Jones 20 30 1234 Doe 50 40 6786 Harris 60 noynot@163.com Name and Address Problem Single-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 Multiple-field format Database 1 Name Mr. J. Smith NAME LOCATION Street 100 Main St. DIANNE ZIEFELD N100 Town Bigtown HARRY H. ENFIELD M300 Country County Luth Code 23565 Database 2 NAME LOCATION ZIEFELD, DIANNE 100 ENFIELD, HARRY H 300 noynot@163.com Quality Data: Importance and Benefits Quality data: • Key to a successful warehouse implementation Quality data helps you in: • • • • • Targeting right customers Determining buying patterns Identifying householders: private and commercial Matching customers Identify historical data noynot@163.com Data Quality Guidelines Operational data: Should not be used directly in the warehouse Must be cleaned for each increment Is not simply fixed by modifying applications noynot@163.com Transformation Techniques Merging data Adding a Date Stamp Adding Keys to Data noynot@163.com Merging Data Operational transactions do not usually map one-to-one with warehouse data. Data for the warehouse is merged to provide information for analysis. Pizza sales/returns by day, hour, seconds Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:02 Anchovy Pizza $12.00 Return 1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale 12:00:04 Sausage Pizza $11.00 1/2/02 noynot@163.com Merging Data Pizza sales/returns by day, hour, seconds Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:02 Anchovy Pizza $12.00 Return 1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale 12:00:04 Sausage Pizza $11.00 1/2/02 Pizza sales Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:04 Sausage Pizza $11.00 noynot@163.com Adding a Date Stamp Time element can be represented as a: • Single point in time • Time span Add time element to: • Fact tables • Dimension data noynot@163.com Adding a Date Stamp: Fact Tables and Dimensions Product Table Product_id Time_key Product_desc Time Table Week_id Period_id Year_id Time_key Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Item Table Item_id Dept_id Time_key noynot@163.com Adding Keys to Data #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 #5 Sale 12:00:04 Sausage Pizza $11.00 1/2/98 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 noynot@163.com Summarizing Data 1. During extraction on staging area 2. After loading to the warehouse server Operational databases Staging area Warehouse database noynot@163.com Maintaining Transformation Metadata Transformation metadata contains: Transformation rules Algorithms and routines Sources Stage Rules Extract Transform Publish Load Query noynot@163.com Maintaining Transformation Metadata Restructure keys Identify and resolve coding differences Validate data from multiple sources Handle exception rules Identify and resolve format differences Fix referential integrity inconsistencies Identify summary data noynot@163.com Data Ownership and Responsibilities Data ownership and responsibilities should be shared by the: • Operational team • Data warehouse team Business benefit gained with “work together” approach noynot@163.com Transformation Timing and Location Transformation is performed: • Before load • In parallel Can be initiated at different points: • On the operational platform • In a separate staging area noynot@163.com Choosing a Transformation Point Workload Impact on environment CPU usage Disk space • • • • Network bandwidth Parallel execution Load window time User information needs noynot@163.com Monitoring and Tracking Transformations should: Be self-documenting Provide summary statistics Handle process exceptions noynot@163.com Designing Transformation Processes Analysis: • Sources and target mappings, business rules • Key users, metadata, grain Design options: • • • • Third-party tools Custom 3GL programs(FORTRAN、Cobol、C、C++、JAVA ) 4GLs like SQL or PL/SQL Replication Design issues: • Performance • Size of the staging area • Exception handling, integrity maintenance noynot@163.com Transformation Tools Third-party tools SQL*Loader In-house developed programs noynot@163.com Summary In this lesson, you should have learned how to: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data Describe techniques for transforming data Design transformation process noynot@163.com Practice 5-1 Overview This practice covers the following topics: Answering a series of questions based on the business scenario for Frontier Airways Answering a series of short questions noynot@163.com Lesson 6 Building the Data Warehouse: Loading Warehouse Data Objectives After completing this lesson, you should be able to do the following: Explain key concepts in loading warehouse data Outline how to build the loading process for the initial load Identify loading techniques Describe the loading techniques provided by Oracle Identify the tasks that take place after data is loaded Explain the issues involved in designing the transportation, loading, and scheduling processes noynot@163.com Loading Data into the Warehouse Loading moves the data into the warehouse Loading can be time-consuming: • Consider the load window • Schedule and automate the loading Initial load moves large volumes of data Subsequent refresh moves smaller volumes of data Transform Extract Operational databases Transport, Load Staging area Warehouse database noynot@163.com Initial Load and Refresh Initial Load: Single event that populates the database with historical data Involves large volumes of data Employs distinct ETL tasks Involves large amounts of processing after load Refresh: Performed according to a business cycle Less data to load than first-time load Less-complex ETL tasks Smaller amounts of post-load processing noynot@163.com Data Refresh Models: Extract Processing Environment After each time interval, build a new snapshot of the database. Purge old snap shots. Operational databases T1 T2 T3 noynot@163.com Data Refresh Models: Warehouse Processing Environment Build a new database. After each time interval, add changes to database. Archive or purge oldest data. Operational databases T1 T2 T3 noynot@163.com Building the Loading Process Techniques and tools File transfer methods The load window Time window for other tasks First-time and refresh volumes Frequency of the refresh cycle Connectivity bandwidth noynot@163.com Building the Loading Process Test the proposed technique Document proposed load Monitor, review, and revise noynot@163.com Data Granularity Important design and operational issue Low-level grain: Expensive, high level of processing, more disk space, more details High-level grain: Cheaper, less processing, less disk space, little details noynot@163.com Loading Techniques Tools Utilities and 3GL Gateways Customized copy programs Replication FTP Manual noynot@163.com Loading Technique Considerations Tools are comprehensive, but costly. Data-movement utilities are fast and powerful. Gateways are suitable for specific instances: • • • • Access other databases Supply dependent data marts Support a distributed environment Provide real-time access if needed Use customized programs as a last resort. Replication is limited by data-transfer rates. noynot@163.com Post-Processing of Loaded Data Transform Extract Load Staging area Create indexes Warehouse Generate keys Post-processing of loaded data Summarize Filter noynot@163.com Indexing Data Before load: Enable indexes at server During load: Adds time to load window, row-by-row approach After load: Adds time to load window, but faster than row-by- row approach Index Operational databases Staging area Warehouse database noynot@163.com Unique Indexes Disable constraints before load. Enable constraints after load. Re-create index if necessary. Disable constraints Enable constraints Load data Create index Catch errors Reprocess noynot@163.com Creating Derived Keys The use of derived or generalized keys is recommended to maintain the uniqueness of a row. Methods: • Concatenate operational key with a number • Assign a number sequentially from a list 109908 109908 01 109908 100 noynot@163.com Summary Management Summary tables Materialized views Summary data noynot@163.com Filtering Data From warehouse to data marts Summary data Warehouse Data marts noynot@163.com Verifying Data Integrity Load data into intermediate file. Compare target flash totals with totals before load. Counts & Amounts = Flash Totals Load Intermediate file Target Counts & Amounts Flash Totals = Preserve, inspect, fix, then load noynot@163.com Steps for Verifying Data Integrity Source files Source files Source files Control Target 3 4 1 Extract SQL*Loader 5 2 6 7 .log .bad noynot@163.com Standard Quality Assurance Checks Load status Completion of the process Completeness of the data Data reconciliation Referential integrity violations Reprocessing Comparison of counts and amounts 1+1=3 noynot@163.com Summary In this lesson, you should have learned how to: Explain key concepts in loading data into the warehouse Outline how to build the loading process for the initial load Identify loading techniques Describe the loading techniques provided by Oracle Identify the tasks that take place after data is loaded Explain the issues involved in designing the transportation, loading, and scheduling processes noynot@163.com Practice 6-1 Overview This practice covers the following topics: Answering a series of short questions Answering questions based on the business scenario for Frontier Airways noynot@163.com Lesson 7 Refreshing Warehouse Data Objectives After completing this lesson, you should be able to do the following: Describe methods for capturing changed data Explain techniques for applying the changes Describe Change Data Capture mechanism and refresh mechanisms supported in Oracle9i Describe techniques for purging and archiving data and outline the techniques supported by Oracle Outline final tasks, such as publishing the data, controlling access, and automating processes List the selection criteria for choosing ETL tools noynot@163.com Developing a Refresh Strategy for Capturing Changed Data Consider load window. Identify data volumes. Identify cycle. Know the technical infrastructure. Plan a staging area. Determine how to detect changes. Operational databases T1 T2 T3 noynot@163.com User Requirements and Assistance Users define the refresh cycle. IT balances requirements against technical issues. Document all tasks and processes. Employ user skills. Operational databases T1 T2 T3 noynot@163.com Load Window Requirements Time available for entire ETL process Plan Test Prove Monitor Load Window 0 3 am 6 User Access Period Load Window 9 12 pm 3 6 9 12 noynot@163.com Planning the Load Window Plan and build processes according to a strategy. Consider volumes of data. Identify technical infrastructure. Ensure currency of data. Consider user access requirements first. High availability requirements may mean a small load window. User Access Period 0 3 am 6 9 12 pm 3 6 9 12 noynot@163.com Scheduling the Load Window 1 Receive data FTP 0 2 Requirements Load cycle Control File File names File types Number of files Number of loads First-time load or refresh Date of file Date range Records in file - counts Totals – amounts 4 Control process Open and read files to verify and analyze 3 3 a.m. noynot@163.com Scheduling the Load Window 5 Load into warehouse 6 8 Verify, analyze, reapply Create summaries 7 Index data 9 Update metadata Parallel load 3 a.m. 6 a.m. 9 a.m. noynot@163.com Scheduling the Load Window 11 10 Backup warehouse 6 a.m. Create views for specialized tools 12 Users access summary data 13 Publish 9 a.m. User access noynot@163.com Capturing Changed Data for Refresh Capture new fact data Capture changed dimension data Determine method of capture in each case Methods: • • • • • Wholesale data replacement Comparison of database instances Time stamping Database triggers Database log noynot@163.com Wholesale Data Replacement Expensive Useful for data marts with less data Limited historical data analysis is possible Time period often exceeds load window Mirroring techniques can be used to provide access to the users noynot@163.com Comparison of Database Instances Delta file: • Changes to operational data since last refresh • Used to update the warehouse Simple to perform, but expensive in terms of time and processing Efficient for smaller volumes of data Yesterday’s operational database Today’s operational database Database comparison Delta file holds changed data noynot@163.com Time and Date Stamping Fast scanning for records changed since last refresh cycle Useful for data with updated date field No detection of deleted data Operational data Delta file holds changed data based on time stamp noynot@163.com Database Triggers Changed data intersected at the server level Extra I/O required Maintenance overhead Operational data Operational server (RDBMS) Delta file holds changed data Triggers on server noynot@163.com Using a Database Log Contains before and after images Requires system checkpoint Common technique Operational data Operational server (DBMS) Log Log analysis and data extraction Delta file holds changed data noynot@163.com Choosing a Method for Change Data Capture Consider each method on merit. Consider a hybrid approach if one approach is not suitable. Consider current technical, operational, and application issues. noynot@163.com Applying the Changes to Data You have a choice of techniques: Overwrite a record Add a record Add a field Maintain history Add version numbers noynot@163.com Overwriting a Record Easy to implement Loses all history Not recommended 42135 John Doe Married 42135 John Doe Single noynot@163.com Adding a New Record History is preserved; dimensions grow. Time constraints are not required. Generalized key is created. Metadata tracks usage of keys. 42135 42135_01 John Doe John Doe Single Married noynot@163.com Adding a Current Field Maintains some history Loses intermediate values Is enhanced by adding an Effective Date field 42135 John Doe Single 42135 John Doe Single Married 1-Jan-01 noynot@163.com Limitations of Methods for Applying Changes Difficult to maintain History Dimensions may grow large Maintenance overhead noynot@163.com Maintaining History: Techniques History tables One-to-many relationships Versioning Preserve complete history noynot@163.com Maintaining History: Techniques History tables: Normalize dimensions Hold current and historical data One-to-many relationships: One current record and many history records HIST_CUST Time CUSTOMER Sales Product noynot@163.com Versioning Avoid double counting Facts hold version number Time Customer.CustId Version Customer Name 1234 1 Comer 1234 2 Comer Sales.CustId Version Sales Facts 1234 1 $11,000 1234 2 $12,000 Customer Sales Product noynot@163.com Preserve Complete History Complete history: • Enables realistic historical analysis • Retains context of data Model must be able to: • Reflect business changes • Maintain context between fact and dimension data • Retain sufficient data to relate old to new noynot@163.com Purging and Archiving Data As data ages, its value depreciates. Remove old data from the warehouse: • Archive for later use (if needed) • Purge without copy noynot@163.com Final Tasks Update metadata Publish data Use database roles to control access to the warehouse Sources Stage Rules Publish Extract Transform Load Query noynot@163.com Publishing Data Control access using database roles Compromise between load action and user access Consider: • Staggering updates • Using temporary tables • Using separate tables noynot@163.com ETL Tools: Selection Criteria Overlap with existing tools Availability of meta model Supported data sources Ease of modification and maintenance Required fine tuning of code Ease of change control Power of transformation logic Level of modularization Power of error, exception, resubmission features Intuitive documentation Performance of code noynot@163.com ETL Tool Selection Criteria Activity scheduling and sophistication Metadata generation Learning curve Flexibility Supported operating systems Cost noynot@163.com Summary In this lesson, you should have learned how to: Describe methods for capturing changed data Explain techniques for applying the changes Describe Change Data Capture mechanism and refresh mechanisms supported in Oracle9i Describe techniques for purging and archiving data and outline the techniques supported by Oracle Outline final tasks, such as publishing the data, controlling access, and automating processes List the selection criteria for choosing ETL tools noynot@163.com Practice 7-1 Overview This practice covers the following topics: Answering a series of questions based on the business scenario for Frontier Airways Answering a series of short questions noynot@163.com Lesson 8 Leaving a Metadata Trail Objectives After completing this lesson, you should be able to do the following: Define warehouse metadata, its types, and its role in a warehouse environment Examine each type of warehouse metadata Develop a metadata strategy Outline the Common Warehouse Metamodel (CWM) noynot@163.com Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone The key to understanding warehouse information Metadata noynot@163.com Metadata Users End users Metadata repository Developers IT Professionals noynot@163.com Types of Metadata End-user metadata: • Key to a good warehouse • Navigation aid • Information provider ETL metadata: • • • • Maps structure Source and target information Transformations Context Operational metadata: • Load, management, scheduling processes • Performance noynot@163.com Examining Types of Metadata ETL metadata End-user metadata Metadata repository External sources Operational data sources ETL End user Warehouse noynot@163.com Examining Metadata: ETL Metadata Business rules Source tables, fields, and key values Ownership Field conversions Encoding and reference table Name changes Key value changes Default values Logic to handle multiple sources Algorithms Time stamp noynot@163.com Extraction Metadata Space and storage requirements Source location information Diverse source data Access information Security Contacts Extraction Program names Frequency details Failure procedures Validity checking information Metadata repository noynot@163.com Transformation Metadata Duplication routines Exception handling Key restructuring Grain conversions Program names Frequency Summarization Metadata repository Transformation noynot@163.com Loading Metadata Method of transfer Frequency Validation procedures Failure procedures Deployment rules Contact information Metadata repository Loading noynot@163.com Examining Metadata: End-User Metadata Location of fact and dimensions Availability Description of contents and algorithms used for derived and summary data Metadata Data ownership details repository End users noynot@163.com End-User Metadata: Context Need to know the context of the table queried Associate the metadata with its description Metadata repository End users noynot@163.com Example of End-User Metadata Table Name Column Name Data Meaning Product Prod_ID 739516 Unique identifier for the product Product Valid_date 01/97 Last refresh date Product Ware_loc 1816 Warehouse location number Product Ware_bin 666 Warehouse bin number Product Code 15 The color of the product; please refer to table COL_REF for details Product Weight 17.62 Packed shipping weight in kilograms noynot@163.com Historic Context of Data Supports change history Maintains the context of information Metadata repository End users 1994 1995 1996 1997 1998 noynot@163.com Types of Context Simple: • Data structures • Naming conventions • Metrics Metadata repository Complex: • Product definitions • Markets • Pricing External: • Economic • Political End users 1994 1995 1996 1997 1998 noynot@163.com Developing a Metadata Strategy Define a strategy to ensure high-quality metadata useful to users and developers. Primary strategy considerations: • • • • • • • • Define goals and intended use Identify target users Choose tools and techniques Choose the metadata location Manage the metadata Manage access to the metadata Integrate metadata from multiple tools Manage change noynot@163.com Defining Metadata Goals and Intended Usage Define clear goals. Identify requirements. Identify intended usage. Metadata noynot@163.com Identifying Target Metadata Users Who are the metadata users? • Developers • End users What information do they need? How will they access the metadata? noynot@163.com Choosing Metadata Tools and Techniques Tools: • Data modeling • ETL • End user (query and analysis) Database schema definitions COBOL copybooks Middleware tools noynot@163.com Choosing the Metadata Location Usually the warehouse server Possibly on operational platforms Desktop tool with metalayer Metadata noynot@163.com Managing the Metadata Managed by the metadata manager Maintained by the metadata architect Standards should be followed noynot@163.com Integrating Multiple Sets of Metadata Multiple tools may generate their own metadata. These metalayers should be properly integrated. Metadata exchangeability is desirable. noynot@163.com Managing Changes to Metadata Different types of metadata have different rates of change. Consider metadata changes resulting from refresh cycles. noynot@163.com Additional Metadata Content and Considerations Summarization algorithms Relationships Stewardship Permissions Pattern analysis Reference tables noynot@163.com Common Warehouse Metamodel Design and Administration Analytic applications Any source ERP Operational External Warehouse Data integration Information delivery Any access Reporting Ad hoc query & analysis Data mining Marts CWM metadata repository noynot@163.com Summary In this lesson, you should have learned how to: Define warehouse metadata, its types, and its role in a warehouse environment Examine each type of warehouse metadata Develop a metadata strategy Outline the Common Warehouse Metamodel (CWM) noynot@163.com Practice 8-1 Overview This practice covers the following topics: Answering a series of short questions Answering questions based on the business scenario for Frontier Airways noynot@163.com Lesson 9 Managing and Maintaining the Data Warehouse Objectives After completing this lesson, you should be able to do the following: Develop a plan for managing the transition from development to implementation Identify challenges pertaining to the growth of the data warehouse Describe backup and archive mechanisms Identify data warehouse performance issues noynot@163.com Managing the Transition to Production Promoting support for change Pilot versus large-scale implementation Documentation Testing Training Post-implementation support Maintaining the warehouse noynot@163.com Promoting Support for the Data Warehouse Awareness Feedback Information Skills Education Direction Control noynot@163.com Choosing Between Pilot and Large-Scale Implementation Pilot Large-Scale Implementation noynot@163.com The Warehouse Pilot Demonstrates benefits to: • Management • Users • IT staff Relevant to the business Low technical risk Small and feasible Anticipates increased use Focused on an initial business issue Remains in context noynot@163.com Piloting the Warehouse Designers: • Prove model, data, and access tools Users: • Prove ease of use of tool • Check data and query performance • Identify training requirements Developers: • Resolve ETL and metadata issues • Determine users data and training requirements • Test security and access levels, monitor performance noynot@163.com Documentation Produces textual deliverables: Glossary User and technical documentation Online help Metadata reference guide Warehouse management reference New features guide noynot@163.com Testing the Warehouse Test every stage. Use a realistic test database and environment. noynot@163.com Training Users: • • • • • Metadata DSS tools Ad hoc queries Getting help Registration of enhancement requests Information systems developers: • • • • Analysis techniques Hardware technicalities Networking Implementing, building, and supporting DSS noynot@163.com Post-Implementation Support Evaluate and review the implementation. Monitor the warehouse: • Respond to problems • Conduct performance tuning • Roll out metadata, queries, reports, filters, and conditions • Implement security • Incorporate new users • Distribute data marts and catalogs • Transfer ownership from IT noynot@163.com Monitoring the Success of the Data Warehouse Number of Users Initial 3 Months 6 Months 12 Months 24 Months Period After Implementation noynot@163.com Measuring the Success of the Data Warehouse Metrics may include: Availability Response time Response to problems noynot@163.com Managing Growth Increasing number of users Broader usage Growth of data volumes Period after Implementation noynot@163.com Expansion and Adjustment Evaluate continually: • • • • Changes New increments Unnecessary components Strategies Ensure open environment Document development processes for the future: • • • • Planning Cost analysis Problem assessment and correction Performance assessment noynot@163.com Controlling Expansion Ensure the continuity of staff. Document processes, solutions, and metrics. Establish working test and production architecture for further increments. Create a strategy for maintaining changes to data. noynot@163.com Sizing Storage Consider different methods. Determine the best for your needs. Know the business requirements. Do not underestimate requirements. Plan for growth. Consider space for unwanted data. noynot@163.com Estimating Storage Fact volumes Fact lifetime Technology availability Technology purchase Storing presummarized data Mirroring or other techniques requiring disk storage noynot@163.com Objects That Need Space ODS Indexes and metadata Summary data Redo logs Rollback information Sort areas Temporary space Workspace for backup and recovery noynot@163.com Other Considerations and Techniques Queuing models Rule of thumb Total database size is three to four times the size of the base fact tables Consider: • • • • • Sparseness Dimensions Indexes Summaries Sort operational space noynot@163.com Space Management Monitor Avoid fragmentation Test load data Plan for growth Know business patterns Never let space become an issue noynot@163.com Archiving Data Determine data life expectancy. Identify archive frequency. Use read-only tablespaces. Include in early specifications. noynot@163.com Purging Data Reduce data volumes: • Create summaries • Remove unwanted base data Choose the most effective method. noynot@163.com Identifying Data Warehouse Performance Issues Improving query efficiency: • • • • Use indexes. Use query governors. Run large jobs out of hours. Consider a data mart approach. Improving network performance: • Provide sufficient bandwidth and optimize configuration for access. • Analyze traffic. • Deploy data marts at remote locations. noynot@163.com Review and Revise Monitor the warehouse: Usage Access Accurate grain Detail data Periodicity noynot@163.com Secret of Success Think big; start small! noynot@163.com Course Summary In this course, you should have learned that the successful warehouse: Is driven by the business Focuses on objectives Adds value to the business Can be understood and used Delivers good data Performs well Belongs to the users noynot@163.com Sample study Note : we discuss the creation of data marts, rather than the perhaps more familiar term, data warehouse. Data warehouses tend to be large, one-stop-shopping repositories where all the historical data for the organization would be stored. Nothing is wrong with this as a concept; however, attempting to create a data warehouse often led to huge, multiyear technology projects that were never quite finished or were outdated when they finally did get done. noynot@163.com Data Mart Structure Measures A Measure is a numeric quantity expressing some aspect of the organization's performance. The information represented by this quantity is used to support or evaluate the decision making and performance of the organization. A measure can also be called a fact. the tables that hold measure information are known as fact tables. noynot@163.com Dimensions A Dimension is a categorization used to spread out an aggregate measure to reveal its constituent parts. Dimensions are used to facilitate this slicing and dicing. noynot@163.com noynot@163.com noynot@163.com noynot@163.com The Star Schema A Star Schema is a relational database schema used to hold measures and dimensions in a data mart. The measures are stored in a fact table and dimensions are stored in dimension tables. noynot@163.com noynot@163.com noynot@163.com noynot@163.com Attributes An Attribute is an additional piece of information pertaining to a dimension member that is not the unique identifier or the description of the member. noynot@163.com noynot@163.com Hierarchies A Hierarchy is a structure mode up of two or more levels of related dimensions. A dimension at an upper level of the hierarchy completely contains one or more dimensions from the next lower level of the hierarchy. noynot@163.com noynot@163.com The Snowflake Schema noynot@163.com