* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 1 The Data Warehouse
Operational transformation wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Forecasting wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
Database model wikipedia , lookup
Data Warehousing Minsoo Lee Dept. Computer Science and Engineering Ewha Womans University Heterogeneous Database Integration Integration System World Wide Web Digital Libraries Scientific Databases Personal Databases • Collects and combines information • Provides integrated view, uniform user interface • Supports sharing Data Warehouse Basics Data warehouse is not an end to itself but part of the BI infrastructure Business Intelligence (BI) – – – Data Warehouse or Data Mart On-Line Analytical Processing (OLAP) Data Mining Business Intelligence Why? – Organizations with purely operational systems – Lack of BI is an enormous competitive disadvantage Unable to make meaningful information out of volumes of data The business of chess – – – What shall I do next?=Strategic thinking=chess Business environment is full of unknowns Business strategist : predict behavior of business nouns Business Intelligence BI – – – Helps develop strategy Must be able to anticipate future conditions Need to understand the past Business Intelligence loop Operational environment Data Warehouse/Data Mart Decision Support Systems (DSS) Business Intelligence Loop Business Strategist OLAP Data Mining Reports Decision Support Data Storage Data Warehouse Extraction, Transformation, & Cleansing CRM Accounting Finance HR Data Warehouse Architecture Monitoring & Administration OLAP Servers Metadata Repository Reconciled data External Sources Extract Transform Load Refresh Analysis Serve Query/Reporting Operational Dbs Data Mining DATA SOURCES TOOLS DATA MARTS Features of the Data Warehouse A Data Warehouse is a subject oriented, integrated, nonvolatile, time variant collection of data in support of management’s decision – W.H. Inmon Subject Orientation Transaction-oriented systems structure data in a way that optimizes processing of transactions (normalization) – DW is concerned with the business nouns (customers, products, sales, etc.) Operational data is distributed across multiple applications – DW gathers all data in one place Subject Orientation Operational Systems (Transaction Oriented) Data Warehouse (Subject Oriented) File 1 Accounts Payable Order Processing File 2 File 3 File 4 Customer Data Accounts Receivable Product Data Sales Data File 5 File 6 Integration Forms a single cohesive environment Data cleansing and Data transformation Data cleansing – – – – Removing errors from the input stream A good cleansing process can improve quality of operational environment Debate on appropriate action when detecting errors: correct in operational environment as well? Cannot detect all errors Integration Data Transformation – – Receives input streams and transform into one consistent format Issue of defining inconsistencies Description Encoding Units of Measure Format Integration Sales Voucher Purchase Order Inventory Description Customer Name I.B.M Customer Name IBM Customer Name International Business Machines Encoding Sex 1 = Male 2 = Female Sex M = Male F = Female Sex X = Male Y = Female Units Cable Length Centimeters Cable Length Yards Cable Length Inches Formats Key Character(10) Key Integer Key pic ‘99999999’ Nonvolatile Once data is written, it remains unchanged in the DW Virtual read-only database system DB can eliminate background processes used for recovery (ex : redo log) Nonvolatile Figure 1-5 Time-Variant Collection of Data Adds time dimension to the data warehouse Creates snapshot of the organization Can view patterns and trends over time Supporting Management’s Decision DW user is the Business strategist Static reports generated by IT dept. can no longer satisfy business strategist Requires appropriate timely performance Design user interface for business strategist Decision Support Systems DSS extends from the extraction of the data through the DW to the presentation to the business strategist Reporting, OLAP, Data Mining Reporting The higher the level of the business strategist, the higher level of summarization required. Enterprise-class reporting – – – – Rapid development Easy maintenance Easy distribution Internet Enabled On-Line Analytical Processing Leverages the time-variant characteristics for strategist to look both back and ahead in time MOLAP (Multi-dimensional OLAP) ROLAP (Relational OLAP) HOLAP (Hybrid OLAP) Typical OLAP interface : spreadsheet style Rotation, roll-up, drill-down Support “what-if” analysis - manipulate variables Data Mining Data mining allows us to see the hidden picture Find Association, Classification – – Association : relationship among data Classification : segment into different classes Use subset of data : size depends on deviation of data characteristics Methods – Decision Trees, Neural Networks, Genetic Modeling Data Warehouse Schema Star Schema Fact Constellation Schema Snowflake Schema Star Schema A single,large and central fact table and one table for each dimension. Every fact points to one tuple in each of the dimensions and has additional attributes. Does not capture hierarchies directly Star Schema Store Dimension Fact Table Time Dimension Store Key Store Key Period Key Store Name Product Key Year City Period Key Quarter Units Month State Region Price Product Key Product Desc Product Dimension •Benefits: Easy to understand, easy to define hierarchies, reduces no. of physical joins. SnowFlake Schema Variant of star schema model. A single,large and central fact table and one or more tables for each dimension. Dimension tables are normalized i.e. split dimension table data into additional tables SnowFlake Schema Store Dimension Fact Table Time Dimension Store Key Period Key Product Key Store Name Year Period Key Quarter City Key Units Month Store Key Price City Dimension City Key City State Region Product Key Product Desc Product Dimension •Drawbacks: Time consuming joins,report generation slow Fact Constellation Multiple fact tables share dimension tables. This schema is viewed as collection of stars hence called galaxy schema or fact constellation. Sophisticated application requires such schema. Fact Constellation Sales Fact Table Store Key Product Dimension Shipping Fact Table Shipper Key Product Key Product Key Store Key Period Key Product Desc Product Key Units Period Key Price Units Price Store Dimension Store Key Store Name City State Region The Future of Data Warehousing Multibillion dollar business! New SQL commands – Performance – Efficient query processing via materialized views Better tools – Support Complex Analysis Easy extraction, easy query formulation, better visualization of data Semistructured data – XML based data warehouses Summary Why build a DW? Business strategist can make a plan for organization to thrive DW is a subject oriented, integrated, nonvolatile, time variant collection of data in support of management’s decisions.