Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA WAREHOUSE Elsayed Hemayed Data Mining Course Outline 2 Introduction Operational System (OLTP) Vs. Data Warehouse (OLAP) Data Warehouse vs. Data Marts Data Warehouse Architecture Data Warehouse Structure Data Warehouse Data, Data everywhere 3 I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need I can’t understand the data I found need an expert to get the data available data poorly documented I can’t use the data I found results are unexpected data needs to be transformed from one form to other Data Warehouse What is a Data Warehouse? 4 A single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context. [Barry Devlin] Data Warehouse What are the users saying... 5 Data should be integrated across the enterprise Summary data has a real value to the organization Historical data holds the key to understanding data over time What-if capabilities are required Data Warehouse What is Data Warehousing? 6 Information A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] Data Data Warehouse Warehouses are Very Large Databases 7 Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Geographic Information Systems National Medical Records Zettabytes -- 10^21 bytes: Weather images Zottabytes -- 10^24 bytes: Intelligence Agency Videos Data Warehouse Data Warehousing -- It is a process 8 Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organization’s operational database Data Warehouse Why Separate Data Warehouse? 9 Performance Operational dbs designed & tuned for known transactions & workloads. Complex OLAP queries would degrade performance for operation transactions. Special data organization, access & implementation methods needed for multidimensional views & queries. Function Missing data: Decision support requires historical data, which operation dbs do not typically maintain. Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: operation dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled. Data Warehouse Key Definition 10 OLTP: On Line Transaction Processing Describes OLAP: On Line Analytical Processing Describes processing at operational sites processing at warehouse “Business Intelligence” refers to reporting and analysis of data stored in the warehouse Data warehouse is the foundation for business intelligence. ‘‘Data warehouse/business intelligence’’ (DW/BI) refers to the complete end-to-end system. Data Warehouse Explorers, Farmers and Tourists 11 Tourists: Browse information harvested by farmers Farmers: Harvest information from known access paths Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data Data Warehouse Data Mining works with Warehouse Data 12 Data Warehousing provides the Enterprise with a memory Data Mining provides the Enterprise with intelligence Data Warehouse To summarize ... 13 Operational (OLTP) Systems are used to “run” a business Data Warehouse The Data Warehouse (OLAP) helps to “optimize” the business 14 Data Warehouse vs. Data Marts What comes first Data Warehouse Data Mart Vs Data Warehouse 15 Data mart is a specific, subject-oriented repository of data that was designed to answer specific questions Usually, multiple data marts exist to serve the needs of multiple business units (sales, marketing, operations, collections, accounting, etc.) Data warehouse is a single organizational repository of enterprise wide data across many or all subject areas. Data warehouse is an enterprise wide collection of data marts Data Warehouse From the Data Warehouse to Data Marts 16 Information Less Individually Structured History Normalized Detailed Departmentally Structured Organizationally Structured Data Data Warehouse Data Warehouse More Data Warehouse and Data Marts 17 Sales Finance Mktg. OLAP Data Mart Lightly summarized Departmentally structured Organizationally structured Atomic Detailed Data Warehouse Data Data Warehouse Characteristics of the Departmental Data Mart 18 Sales Finance Mktg. Data Warehouse OLAP Small Flexible Customized by Department Source is departmentally structured data warehouse Data Mart Centric 19 Data Sources Data Marts Data Warehouse Data Warehouse Problems with Data Mart Centric Solution 20 If you end up creating multiple warehouses, integrating them is a problem Data Warehouse True Warehouse 21 Data Sources Data Warehouse Data Marts Data Warehouse 22 Data Warehouse Architecture Data Warehouse Data Warehouse Architecture 23 Relational Databases Optimized Loader ERP Systems Extraction Cleansing Data Warehouse Engine Purchased Data Legacy Data Data Warehouse Metadata Repository Analyze Query Implementing a Warehouse 24 Monitoring: Getting the data from the sources Data Integration Cleansing Loading Processing: Query processing, indexing, ... Managing: Metadata, Design, ... Data Warehouse Monitoring 25 Source Types: relational, flat file, IMS, WWW, news-wire, … Incremental vs. Refresh customer Data Warehouse id 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la new Monitoring Techniques 26 Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Application level monitoring Data Warehouse Monitoring Issues 27 Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes, ... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways Data Warehouse Refresh 28 Propagate updates on source data to the warehouse Issues: when to refresh how to refresh -- refresh techniques Data Warehouse When to Refresh? 29 periodically (e.g., every night, every week) or after significant events on every update: not warranted unless warehouse data require current data (up to the minute stock quotes) refresh policy set by administrator based on user needs and traffic possibly different policies for different sources Data Warehouse How To Detect Changes 30 Create a snapshot log table to record ids of updated rows of source data and timestamp Detect changes by: Defining after row triggers to update snapshot log when source table changes Using regular transaction log to detect changes to source data Data Warehouse Data Integration Across Sources 31 Savings Same data different name Data Warehouse Loans Different data Same name Trust Data found here nowhere else Credit card Different keys same data Data Transformation Example 32 Data Warehouse appl appl appl appl A - m,f B - 1,0 C - x,y D - male, female appl appl appl appl A - pipeline - cm B - pipeline - in C - pipeline - feet D - pipeline - yds appl appl appl appl A - balance B - bal C - currbal D - balcurr Data Warehouse Data Integrity Problems 33 Same person, different spellings Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names Ahmed, Ahmad, Ahmaad etc... Oct 6, 6 Oct Different account numbers generated by different applications for the same customer Required fields left blank Invalid product codes collected at point of sale manual entry leads to mistakes “in case of a problem use 9999999” Data Warehouse Data Extraction and Cleansing 34 Extract data from existing operational and legacy data Issues: Sources of data for the warehouse Data quality at the sources Merging different data sources Data Transformation How to propagate updates (on the sources) to the warehouse Terabytes of data to be loaded Data Warehouse Scrubbing Data 35 Sophisticated transformation tools. Used for cleaning the quality of data Clean data is vital for the success of the warehouse Example Ahmed Aly, Ahmad Ali, Ahmaad Aly, Ahmad Aly, etc. are the same person Scrubbing Tools Apertus -- Enterprise/Integrator Vality -- IPE Postal Soft Data Warehouse Data Loading 36 After extracting, cleaning, validating etc. need to load the data into the warehouse Issues huge volumes of data to be loaded small time window available when warehouse can be taken off line (usually nights) when to build index and summary tables allow system administrators to monitor, cancel, resume, change load rates Recover gracefully -- restart after failure from where you were and without loss of data integrity Data Warehouse Load Techniques 37 Use SQL to append or insert new data record at a time interface will lead to random disk I/O’s Use batch load utility Incremental versus Full loads Online versus Offline loads Data Warehouse 38 Data Warehouse Structure Data Warehouse Data Warehouse Structure 39 Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables Data Warehouse Data Warehouse Structure 40 base customer (1985-87) custid, from date, to date, name, phone, dob Time is base customer (1988-90) custid, from date, to date, name, credit rating, employer part of key of customer activity (1986-89) -- monthly summary each table customer activity detail (1987-89) custid, activity date, amount, clerk id, order no customer activity detail (1990-91) custid, activity date, amount, line item no, order no Data Warehouse Data Granularity in Warehouse 41 Summarized data stored reduce storage costs reduce cpu usage increases performance since smaller number of records to be processed design around traditional high level reporting needs tradeoff with volume of data to be stored and detailed usage of data Data Warehouse Granularity in Warehouse 42 Can not answer some questions with summarized data Did Ahmed call Aly last month? Not possible to answer if total duration of calls by Ahmed over a month is only maintained and individual call details are not. Detailed data too voluminous Data Warehouse Granularity in Warehouse 43 Tradeoff is to have dual level of granularity Store summary data on disks 95% Store 5% Data Warehouse of DSS processing done against this data detail on tapes of DSS processing against this data Vertical Partitioning 44 Acct. No Name Balance Date Opened Interest Rate Address Frequently accessed Acct. No Balance Rarely accessed Acct. No Name Date Opened Smaller table and so less I/O Data Warehouse Interest Rate Address Schema Design 45 Database organization must look like business must be recognizable by business user approachable by business user Must be simple Schema Types Star Schema Fact Constellation Schema Snowflake schema Data Warehouse Dimensional Modeling 46 Fact Table Dimension Table product prodId name price sale orderId date custId prodId storeId qty amt Dimension Table store storeId city Data Warehouse Dimension Table customer custId name address city Fact Tables 47 Contain the metrics resulting from a business process or measurement event, such as the sales ordering process or service call event Dimensional models should be structured around business processes and their associated data sources, This results in ability to design identical, consistent views of data for all observers, regardless of which business unit they belong to, which goes a long way toward eliminating misunderstandings at business meetings Fact table’s granularity should be set at the lowest, most atomic level captured by the business process This allows for maximum flexibility and extensibility. Business users will be able to ask constantly changing, free-ranging, and very precise questions. Data Warehouse Fact Table 48 Central table mostly raw numeric items narrow rows, a few columns at most large number of rows (millions to a billion) Access via dimensions Data Warehouse Dimension Tables 49 Contain the descriptive attributes and characteristics associated with specific, tangible measurement events, such as the customer, product, or sales representative associated with an order being placed. Dimension attributes are used for constraining, grouping, or labeling in a query. Hierarchical many-to-one relationships are denormalized into single dimension tables. Data Warehouse Dimension Table 50 Define business in terms already familiar to users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions time periods, geographic region (markets, cities), products, customers, salesperson, etc. Data Warehouse Star Schema 51 A single fact table and multiple dimension tables T i m e c u s t Data Warehouse date, custno, prodno, cityname, ... f a c t p r o d c i t y Star Schema Example 52 product prodId p1 p2 name price bolt 10 nut 5 sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 customer Data Warehouse custId 53 81 111 custId 53 53 111 name joe fred sally prodId p1 p2 p1 storeId c1 c1 c3 address 10 main 12 main 80 willow store storeId c1 c2 c3 qty 1 2 5 amt 12 11 50 city sfo sfo la city nyc sfo la Star Schema Example 53 product prodId name price sale orderId date custId prodId storeId qty amt store storeId city Data Warehouse customer custId name address city Snowflake schema 54 The tables which describe the dimensions are normalized. Easy to maintain and saves storage T i m e c u s t Data Warehouse p r o d date, custno, prodno, cityname, ... f a c t c i t y r e g Snowflake Schema Example 55 sType store store storeId s5 s7 s9 city cityId sfo sfo la tId t1 t2 t1 region mgr joe fred nancy sType tId t1 t2 city size small large cityId pop sfo 1M la 5M location downtown suburbs regId north south region regId name north cold region south warm region Data Warehouse Fact Constellation 56 Multiple fact tables that share many dimension tables Booking and Checkout may share many dimension tables in the hotel industry Hotels Booking Checkout Travel Agents Data Warehouse Customer Promotion Room Type Hybrid Approach 57 If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snowflaking may be appropriate In practice, many data warehouses will normalize some dimensions and not others, and hence use a combination of snowflake and classic star schema. Data Warehouse Partitioning 58 Breaking data into several physical units that can be handled separately Not a question of whether to do it in data warehouses but how to do it Granularity and partitioning are key to effective implementation of a warehouse Data Warehouse Why Partition? 59 Flexibility in managing data Smaller physical units allow easy restructuring free indexing sequential scans if needed easy reorganization easy recovery easy monitoring Data Warehouse Criterion for Partitioning 60 Typically partitioned by date line of business geography organizational unit any combination of above Data Warehouse Query Processing 61 Indexing Parallel Query Processing Pre computed views/aggregates SQL extensions Extended family of aggregate functions rank (top 10 customers) percentile (top 30% of customers) median, mode Reporting running Data Warehouse features total, cumulative totals Metadata Repository 62 Administrative metadata source databases and their contents gateway descriptions warehouse schema, view & derived data definitions dimensions, hierarchies pre-defined queries and reports data mart locations and contents data partitions data extraction, cleansing, transformation rules, defaults data refresh and purging rules user profiles, user groups security: user authorization, access control Data Warehouse Metdata Repository .. 2 63 Business data business terms and definitions ownership of data charging policies operational metadata data lineage: history of migrated data and sequence of transformations applied currency of data: active, archived, purged monitoring information: warehouse usage statistics, error reports, audit trails. Data Warehouse Data Warehouse References 64 W.H. Inmon, Building the Data Warehouse, Second Edition, John Wiley and Sons, 1996 W.H. Inmon, J. D. Welch, Katherine L. Glassey, Managing the Data Warehouse, John Wiley and Sons, 1997 Barry Devlin, Data Warehouse from Architecture to Implementation, Addison Wesley Longman, Inc 1997 Data Warehouse Summary 65 Introduction Operational System (OLTP) Vs. Data Warehouse (OLAP) Data Warehouse vs. Data Marts Data Warehouse Architecture Data Warehouse Structure Data Warehouse