Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to Data Warehousing 1 Business Intelligence Now,if the Estimates made before a Battle indicate Victory,it is because careful calculations show that your conditions are more favorable than those of your enemy;if they indicate defeat ,it is because careful calculations show that the favorable conditions for a Battle are fewer.With more careful calculations one can win ; with less one cannot. How much chance of Victory has one who makes no calculations at all !! --- Sun Tzu , The Art of War Business these days are ,war minus shooting. -Anonymous Course Roadmap • Introduction to Datawarehousing • Difference between Operational System and DataWarehouse • Emergence of Decision Support Systems • DataWarehouse Theoretical Architecture • DataWarehouse Technical Architecture • DataWarehouse Bus Architecture • Data Modelling concepts • E-R Modelling for OLTP System • Dimensional Modelling for a Datawarehouse • Scheme generation for Datawarehouse • Star Scheme Design • Snowflake Scheme Design • Key aspects in designing the Dimensional Model • Granularity with respect to the Fact Table in the Schemas • Conformed Facts,Dimensions Course Roadmap • Fact less Fact Tables,Aggregate Fact Tables • Out Trigger Entities in the Schemas • Types of Relationships to be maintained between Facts and Dimensions • Dependencies while generating Physical Scheme for a DataWarehouse • Case Study of design of DataWarehouse for an existing ERmodel Objectives At the end of this session, you will know : – What is Data Warehousing – The evolution of Data Warehousing – Need for Data Warehousing – OLTP Vs Warehouse Applications – Data marts Vs Data Warehouses – Operational Data Stores – Overview of Warehouse Architecture Objectives At the end of this lesson, you will know : – Data Warehouse Architectures – Components of Data Warehousing Architecture – An overview of each of the components – Considerations for Data Warehouse Design – Common mistakes in Warehouse designs – An overview of Warehouse on the web What is a DataWarehouse ? What is a Data Warehouse ? A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. - WH Inmon WH Inmon - Regarded As Father Of Data Warehousing Subject-Oriented- Characteristics of a Data Warehouse Operational Data Warehouse Leads Prospects Customers Products Quotes Orders Regions Time Focus is on Subject Areas rather than Applications Integrated - Characteristics of a Data Warehouse Appl A - m,f Appl B - 1,0 Appl C - male,female Appl A - balance dec fixed (13,2) Appl B - balance pic 9(9)V99 Appl C - balance pic S9(7)V99 comp-3 m,f balance dec fixed (13,2) Appl A - bal-on-hand Appl B - current-balance Appl C - cash-on-hand Current balance Appl A - date (julian) Appl B - date (yymmdd) Appl C - date (absolute) date (julian) Integrated View Is The Essence Of A Data Warehouse Non-volatile - Characteristics of a Data Warehouse insert change Data Warehouse Operational delete insert load replace change Data Warehouse Is Relatively Static In Nature read only access Time Variant - Characteristics of a Data Warehouse Operational Current Value data • time horizon : 60-90 days Data Warehouse Snapshot data • time horizon : 5-10 years •data warehouse stores historical data Data Warehouse Typically Spans Across Time Alternate Definitions A collection of integrated, subject oriented databases designed to support the DSS function, where each unit of data is relevant to some moment of time - Imhoff Alternate Definitions Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support - Babcock Evolution of Data Warehousing 1960 - 1985 : MIS Era • Unfriendly • Slow • Dependent on IS programmers • Inflexible • Analysis limited to defined reports Focus on Reporting Evolution of Data Warehousing 1985 - 1990 : Querying Era Queries that are formulated by the user on the spur of the moment • Adhoc, unstructured access to corporate data • SQL as interface not scalable • Cannot handle complex analysis Focus on Online Querying Evolution of Data Warehousing 1990 - 20xx : Analysis Era • Trend Analysis • What If ? • Cross Dimensional Comparisons • Statistical profiles • Automated pattern and rule discovery Focus on Online Analysis Need for Data Warehousing Better business intelligence for end-users Reduction in time to locate, access, and analyze information Consolidation of disparate information sources Strategic advantage over competitors Faster time-to-market for products and services Replacement of older, less-responsive decision support systems Reduction in demand on IS to generate reports Business Queries Typical Business Queries Which product generated maximum revenue over last two quarters in a chosen geographical region, city wise, relative to the previous version of product, compared with the plan What percent of customer procures product A with B in a chosen region, broken down by city, season, and income group OLTP Systems Vs Data Warehouse Remember Between OLTP and Data Warehouse systems users are different data content is different, data structures are different hardware is different Understanding The Differences Is The Key OLTP Vs Warehouse Operational System Data Warehouse Transaction Processing Query Processing Predictable CPU Usage Random CPU Usage Time Sensitive History Oriented Operator View Managerial View Normalized Efficient Denormalized Design for Design for TP Query Processing OLTP Vs Warehouse Operational System Data Warehouse Designed for Atmocity, Consistency, Isolation and Durability Designed for quite or static database Organized by transactions (Order, Input, Inventory) Organized by subject (Customer, Product) Relatively smaller database Large database size Many concurrent users Relatively few concurrent users Volatile Data Non Volatile Data OLTP Vs Warehouse Operational System Data Warehouse Stores all data Stores relevant data Performance Sensitive Less Sensitive to performance Not Flexible Flexible Efficiency Effectiveness Processing Power Capacity Planning Time of day Processing Load Peaks During the Beginning and End of Day Examples Of Some Applications Manufacturers Retailers Target Marketing Market Segmentation Budgeting Credit Rating Agencies Financial Reporting and Consolidation Market Basket Analysis - POS Analysis Fraud Management Profitability Management Event tracking Customers Do we need a separate database ? OLTP and data warehousing require two very differently configured systems Isolation of Production System from Business Intelligence System Significant and highly variable resource demands of the data warehouse Cost of disk space no longer a concern Production systems not designed for query processing Data Marts Enterprise wide data warehousing projects have a very large cycle time Getting consensus between multiple parties may also be difficult Departments may not be satisfied with priority accorded to them Sometimes individual departmental needs may be strong enough to warrant a local implementation Application/database distribution is also an important factor Data Marts Subject or Application Oriented Business View of Warehouse » Finance, Manufacturing, Sales etc. » Smaller amount of data used for Analytic Processing » Address a single business process A Logical Subset of The Complete Data Warehouse Data Warehouse and Data Mart Data Warehouse Data Marts Scope Application Neutral Centralized, Shared Cross LOB/enterprise Data Perspective Historical Detailed data Some summary Specific Application Requirement LOB, department Business Process Oriented Detailed (some history) Summarized Subjects Multiple subject areas Single Partial subject Multiple partial subjects OLTP snapshots Data Warehouse and Data Mart Data Warehouse Data Marts Data Sources Many Operational/ External Data Few Operational, external data OLTP snapshots Implement Time Frame 9-18 months for first stage Multiple stage implementation Flexible, extensible Durable/Strategic Data orientation 4-12 months Characteristics Restrictive, non extensible Short life/tactical Project Orientation Warehouse or Mart First ? Data Warehouse First Data Mart first Expensive Relatively cheap Large development cycle Delivered in < 6 months Change management is difficult Easy to manage change Difficult to obtain continuous corporate support Can lead to independent and incompatible marts Technical challenges in building large databases Cleansing, transformation, modeling techniques may be incompatible Different kinds of Information Needs Current Recent Historical Is this medicine available in stock What are the tests this patient has completed so far Has the incidence of Tuberculosis increased in last 5 years in Southern region Operational Data Store - Definition Can I see credit report from Accounts, Sales from marketing and open order report from order entry for this customer Data from multiple sources is integrated for a subject A subject oriented, integrated, volatile, current valued data store containing only corporate Identical queries may give different results at different times. Supports analysis requiring current data detailed data Data stored only for current period. Old Data is either archived or moved to Data Warehouse Operational Data Store Increasingly becoming integrated with the data warehouse Are nothing but more responsive real time data warehouses Data Mining has anyway forced Data Warehouses to store transactional level data OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Operating Analysts Managers and Personnel analysts Individual records, Individual records, Set of records, Data access transaction driven transaction or analysis driven analysis driven Current, real-time Current and near- Historical Data content current Detailed and lightly Summarized and Data granularity Detailed summarized derived Subject-oriented Subject-oriented Data organization Functional Audience Data quality All application specific detailed data needed to support a business activity All integrated data Data relevant to needed to support a management business activity information needs OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Data redundancy Somewhat redundant with operational databases Managed redundancy Data stability Non-redundant within system; Unmanaged redundancy among systems Dynamic Data update Field by field Field by field Controlled batch Data usage Highly structured, repetitive Somewhat structured, some analytical Database size Moderate Moderate Highly unstructured, heuristic or analytical Large to very large Somewhat stable Dynamic Stable Database structure stability Somewhat dynamic Static OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Development methodology Requirements driven, structured Data driven, evolutionary Operational priorities Performance and availability Data driven, somewhat evolutionary Availability Philosophy Support day-today operation Predictability Stable Response time Sub-second Support day-to-day decisions & operational activities Mostly stable, some Unpredictable unpredictability Seconds to minutes Seconds to minutes Return set Small amount of data Small to medium amount of data Access flexibility and end user autonomy Support managing the enterprise Small to large amount of data Typical Data Warehouse Architecture Data Marts EIS /DSS Select Metadata Query Tools Extract Transform Integrate Maintain Data Warehouse OLAP/ROLAP Web Browsers Operational Systems/Data Data Preparation Middleware/ API Multi-tiered Data Warehouse without ODS Data Mining Typical Data Warehouse Architecture Data Marts Metadata Metadata Select Select Extract Extract Transform Integrate ODS Transform Data Warehouse Load Maintain Operational Systems/Data Data Preparation Data Preparation Multi-tiered Data Warehouse with ODS Benefits of DWH These capabilities empower the corporate... To formulate effective business, marketing and sales strategies. To precisely target promotional activity. To discover and penetrate new markets. To successfully compete in the marketplace from a position of informed strength. To build predictive rather than retrospective models. Warehouse Architecture - 1 EIS /DSS Metadata Query Tools Select Extract Transform Integrate Data Warehouse OLAP/ROLAP Maintain Web Browsers Operational Systems/Data Data Preparation Middleware/ API Data Mining Enterprise Data Warehouse Warehouse Architecture - 2 Metadata EIS /DSS Data Mart Select Metadata Query Tools Extract Transform Data Mart Integrate Maintain OLAP/ROLAP Metadata Web Browsers Operational Systems/Data Data Mart Data Preparation Middleware/ API Data Mining Single Department Data Mart Warehouse Architecture - 3 Data Marts EIS /DSS Metadata Query Tools Select Extract Transform Data Warehouse Integrate OLAP/ROLAP Maintain Web Browsers Operational Systems/Data Data Preparation Operational Data Store Multi-tiered Data Warehouse Middleware/ API Data Mining Data Warehouse Architectures There are three schools of thought about DW architectures – One supports Dimensional Modeling all through (Ralph Kimball) – Second supports ER for Data Warehouse and Star Schemas for Data Marts – Third supports ER model for DW (NCR) Kimball’s View Operational Systems Presentation Server Staging Area Each Star is a Data Mart and has both summary and detail data LAN Data Warehouse Server Processes •Extract •Scrubbing •Transformation •Load Jobs •Aggregation Jobs •Replication •Monitoring •Management •Meta Data Repository •Meta Data Population •Meta Data Maintenance DW is sum total of all Data Marts DW Bus using Conformed Dimensions Multiple Data Marts With Conformed Dimensions Inmon’s View Operational Systems Staging Area Data Warehouse Data Marts LAN Data Warehouse Server Processes •Extract •Scrubbing •Transformation •Load Jobs •Aggregation Jobs •Replication •Monitoring •Management •Meta Data Repository •Meta Data Population •Meta Data Maintenance Detail Data in ER format Summarized Data in Star formats Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema) Components of a Data Warehouse Architecture Source Databases Data extraction/transformation/load (ETL) tool Data warehouse maintenance and administration tools Data modeling tool or interface to external data models Warehouse databases End-user data access and analysis tools Components of a Data Warehouse Architecture Data Cleansing Tools Source Databases Data Modeling Tool ETL Tool Central Metadata ROLAP Engine Data Access and Analysis Tools -Managed Query Central Warehouse (RDBMS) RDBMS -Desktop OLAP -ROLAP -MOLAP Local meta data Warehouse Admin Tool - Data Mining MDDB Architected Datamarts Warehouse Databases Data Warehouse Is Not Just About Data... But Tools Too Source Databases - Characteristics Legacy, relational, text or external sources Designed for high-speed transaction processing Real-time, current, volatile data Fast response for larger numbers of concurrent users Many short transactions Update-intensive; modifications by row Inquiry-oriented; access by keys High integrity, security, recoverability Source data is often inconsistent and poorly modeled Data Cleaning Tools To clean data at the source Clean up source data in-place on the host Business rule discovery tools which analyse the source data and write cleaning rules based on lexical analysis and AI techniques Poorly integrated with data warehousing tools ETL tools have limited yet adequate data cleansing functionality