Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Intro. to Data Warehouse รศ.ดร. วรพจน์ กรี สุระเดช Worapoj Kreesuradej, Ph.D. Associate Professor Data Mining & Data Exploration Laboratory (DME Lab), Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Web: www.it.kmitl.ac.th/dme Email: [email protected] Book Paulraj Ponniah, Data Warehousing Fundamentals, John Wiley & Sons, 2001. Ralph Kimbal and Margy Ross, The Data Warehouse Toolkit, John Wiley and Sons, 2002. Definition of DW “A collection of integrated, subjectoriented databases designed to supply the information required for decisionmaking.” - W. Inmon A decision support database that is maintained separately from the organization’s operational databases. A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format –E. Turban and etc. R. Kimball’s definition of a DW A data warehouse is a copy of transactional data specifically structured for querying and analysis. Problem: Data Management in Large Enterprises Vertical fragmentation of informational systems Result of application (user)-driven development of operational systems Sales Planning Suppliers Num. Control Stock Mngmt Debt Mngmt Inventory ... ... ... Sales Administration Finance Manufacturing ... Problem: Data Management in Large Enterprises Two Approaches for accessing data: Query-Driven (Lazy) ? Warehouse (Eager) Source Source The Need for DW Query-driven (lazy, on-demand) Clients Integration System Metadata ... Wrapper Source Wrapper Source Wrapper ... Source Disadvantages of QueryDriven Approach Delay in query processing Inefficient and potentially expensive for frequent queries Competes with local processing at sources The Warehousing Approach Information integrated in advance Stored in wh for direct querying and analysis Extractor/ Monitor Source Clients Data Warehouse Integration System Metadata ... Extractor/ Monitor Source Extractor/ Monitor ... Source Advantages of Warehousing Approach High query performance Doesn’t interfere with local processing at sources Information copied at warehouse Can modify, annotate, summarize, restructure, etc. Can store historical information Security, no auditing Characteristics of DW Subject oriented Data are organized by how users refer to it Integrated Non-volatile Time variant Inconsistencies are removed in both nomenclature and conflicting information; (i.e. data are ‘clean’) Read-only data. Data do not change over time. Data are time series, not current status Subject Oriented Data Warehouse is designed around “subjects” rather than processes A company may have Retail Sales System Outlet Sales System Catalog Sales System DW will have a Sales Subject Area Subject Oriented OLTP Systems Retail Sales System Outlet Sales System Catalog Sales System Data Warehouse Sales Subject Area Subject-Oriented Sales Information Integrated Heterogeneous Source Systems Need to Integrate source data For Example: Product codes could be different in different systems Arrive at common code in DW Integrated Information integrated in advance Stored in DW for direct querying and analysis Clients Data Warehouse Integration System Metadata ... Extractor/ Monitor Source Extractor/ Monitor Source Extractor/ Monitor ... Source Non-Volatile Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data. Non-Volatile(Read-Mostly) Write USER OLTP Read USER DW Read Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 510 years) Time Variant Most business analysis has a time component Sales 2001 Trend Analysis (historical data is required) 2002 2003 2004 Data Warehousing Process Overview Data Warehousing Process Overview The major components of a data warehousing process Data sources Data extraction Data loading Comprehensive Database /Data Store Data Mart Metadata Middleware tools /information delivery tools ETL • Data Extraction • Data Cleaning and Transformation • Convert from legacy/host format to warehouse format Load Sort, summarize, consolidate, compute views, check integrity, build indexes, partition The ETL Process Source Systems Staging Area Extract Transform Load DW Database Data Staging Area • A storage area where extracted data is • • • • • cleaned, transformed and deduplicated. Initial storage for data Need not be based on Relational model Mainly sorting and Sequential processing Does not provide data access to users Analogy – kitchen of a restaurant ETL Process Issues & Challenges • Consumes 70-80% of project time • Heterogeneous Source Systems • Little or no control over source systems • Source systems scattered • Different currencies, measurement units • Ensuring data quality Comprehensive Database /Data Store Mostly a relational DB Oracle, DB2, Sybase, SQL Server New DB design for special purpose of DW (e.g., scale up, speed up, parallel processing) Data Warehouse Design OLTP Systems are Data Capture Systems “DATA IN” systems DW are “DATA OUT” systems OLTP DW Dimensional Modeling Facts are stored in FACT Tables Dimensions are stored in DIMENSION tables Dimension tables contains textual descriptors of business Fact and dimension tables form a Star Schema “BIG” fact table in center surrounded by “SMALL” dimension tables Star Schema CUSTOMER TIME # * * * * * * * * * * TIME_KEY ORD ERD ATE D AY_ OF_ WEEK D AY_ NU MBER_IN_MONTH D AY_ NU MBER_IN_YEAR WEEK_N UMBER MON TH QUARTER H OL IDAY_FL AG FISC AL_ YEAR FISC AL_ QUARTER referenced by referenced by SALES reference # # # * * * reference TIME_KEY PRODUC T_KEY CUSTOMER _KEY PRIC E QUANTITY SALES reference referenced by PRODUCT # * * * # * * * * PRODUC T_KEY PID PNAME PCN AME CUSTOMER _KEY C ID C NAME STATE C ITY Star Schema Data mart Data mart = subset of DW for community users, e.g. accounting department Sometimes exist as Multidimensional Database Info mart = summarized data + report for community users Meta Data Data about data Needed by both information technology personnel and users IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc. Information Delivery Tools Tools Query & reporting OLAP Data mining, visualization, segmentation, clustering New developments: text mining, web mining & personalization Mining multimedia data Information Delivery Tools Commercial tools Crystal Report, Impromptu, WebFocus Increasingly common mode of delivery: Web-enabled Data Warehouse Architecture Data Flow Architecture System Architecture Data Flow Architecture Data Flow Architecture Data Flow Architecture Operational data stores (ODS) A type of database often used as an interim area for a data warehouse, especially for customer information files MDB=Multidimensional databases System Architectures Three parts of the data warehouse The data warehouse that contains the data and associated software Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse Client (front-end) software that allows users to access and analyze data from the warehouse System Architectures System Architectures System Architecture System Architecture Data Warehouse Development Data warehouse development approaches Inmon Model: EDW approach, Enterprisewide warehouse, top down Kimball Model: Data mart approach, Data mart, bottom up Which model is best? There is no one-size-fits-all strategy to data warehousing When properly executed, both result in an enterprise-wide data warehouse, but with different architectures The Data Mart Strategy The most common approach Begins with a single mart and architected marts are added over time for more subject areas Relatively inexpensive and easy to implement Can be used as a proof of concept for data warehousing Can perpetuate the “silos of information” problem Can postpone difficult decisions and activities Requires an overall integration plan The Enterprise-wide Strategy A comprehensive warehouse is built initially An initial dependent data mart is built using a subset of the data in the warehouse Additional data marts are built using subsets of the data in the warehouse Like all complex projects, it is expensive, time consuming, and prone to failure When successful, it results in an integrated, scalable warehouse DW Lifecycle (Ralph Kimball ) Data Warehouse Development Some best practices for implementing a data warehouse (Weir, 2002): Project must fit with corporate strategy and business objectives There must be complete buy-in to the project by executives, managers, and users It is important to manage user expectations about the completed project The data warehouse must be built incrementally Build in adaptability Data Warehouse Development Some best practices for implementing a data warehouse (Weir, 2002): The project must be managed by both IT and business professionals Develop a business/supplier relationship Only load data that have been cleansed and are of a quality understood by the organization Do not overlook training requirements Be politically aware