* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Survey
Document related concepts
Transcript
13 Chapter 13 The Data Warehouse 1 13 The Need for Data Analysis • Managers must be able to track daily transactions to evaluate how the business is performing • By tapping into the operational database, management can develop strategies to meet organizational goals • Data analysis can provide information about short-term tactical evaluations and strategies 2 13 Solving Business Problems and Adding Value with Data Warehouse-Based Solutions 3 13 Solving Business Problems and Adding Value with Data Warehouse-Based Solutions (continued) 4 13 Decision Support Systems • Methodology (or series of methodologies) designed to extract information from data and to use such information as a basis for decision making • Decision support system (DSS): – Arrangement of computerized tools used to assist managerial decision making within a business – Usually requires extensive data “massaging” to produce information – Used at all levels within an organization – Often tailored to focus on specific business areas – Provides ad hoc query tools to retrieve data and to display data in different formats 5 13 Decision Support Systems (continued) • Composed of four main components: – Data store component • Basically a DSS database – Data extraction and filtering component • Used to extract and validate data taken from operational database and external data sources – End-user query tool • Used to create queries that access database – End-user presentation tool • Used to organize and present data 6 13 Main Components of a Decision Support System (DSS) 7 13 Transforming Operational Data Into Decision Support Data 8 13 Contrasting Operational and DSS Data Characteristics 9 13 The Data Warehouse • Integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making 10 13 A Comparison of Data Warehouse and Operational Database Characteristics 11 13 Creating a Data Warehouse 12 13 The ETL Process • • • • Capture/Extract Scrub or data cleansing Transform Load and Index ETL = Extract, transform, and load 13 Capture/Extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse 13 Figure 11-10: Steps in data reconciliation Static extract = capturing a snapshot of the source data at a point in time Incremental extract = capturing changes that have occurred since the last static extract 14 Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality 13 Figure 11-10: Steps in data reconciliation (cont.) Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data 15 13 Cleanse: • Process to identify erroneous data, not to fix them • Fixes are made at the source Scrubbing: A technique using pattern recognition and other AI techniques to upgrade the quality of data 16 Transform = convert data from format of operational system to format of data warehouse 13 Figure 11-10: Steps in data reconciliation (cont.) Record-level: Selection–data partitioning Joining–data combining Aggregation–data summarization Field-level: single-field–from one field to one field multi-field–from many fields to one, or one field to many 17 Load/Index= place transformed data into the warehouse and create indexes 13 Figure 11-10: Steps in data reconciliation (cont.) Refresh mode: bulk rewriting of target data at periodic intervals Update mode: only changes in source data are written to data warehouse 18 Figure 11-11: Single-field transformation 13 In general–some transformation function translates data from old form to new form Algorithmic transformation uses a formula or logical expression Table lookup–another approach, uses a separate table keyed by source record code 19 Figure 11-12: Multifield transformation 13 M:1–from many source fields to one target field 1:M–from one source field to many target fields 20 Derived Data 13 • Objectives – – – – – Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities • Characteristics – Detailed (mostly periodic) data – Aggregate (for summary) – Distributed (to departmental servers) Most common data model = star schema (also called “dimensional model”) 21 13 Star Schemas • Data modeling technique used to map multidimensional decision support data into a relational database • Creates the near equivalent of a multidimensional database schema from the existing relational database • Yield an easily implemented model for multidimensional data analysis, while still preserving the relational structures on which the operational database is built • Has four components: facts, dimensions, attributes, and attribute hierarchies 22 Figure 11-13 Components of a star schema 13 Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Dimension tables contain descriptions about the subjects of the business Excellent for ad-hoc queries, but bad for online transaction processing 23 Figure 11-14 Star schema example 13 Fact table provides statistics for sales broken down by product, period and store dimensions 24 Figure 11-15 Star schema with sample data 13 25 13 Size of fact table assume: Total number of stores=1000 Total number of products=10,000 Total number of periods= 24 (two years) Assume 50% of products record sales Then total rows in fact table 1000stores* 5000 active products)*24 months =120,000,000 rows 26 13 Size of “fact” table Assume there are 6 fields each 4 bytes long, then total size 120,00,000* 6 fields* 4bytes/field =2,880,000,000 (2.88 gigabytes) If instead of monthly, you record daily data Multiply above by 30 (30 days/per month) 27 13 Online Analytical Processing • Advanced data analysis environment that supports decision making, business modeling, and operations research • OLAP systems share four main characteristics: – Use multidimensional data analysis techniques – Provide advanced database support – Provide easy-to-use end-user interfaces – Support client/server architecture 28 13 Operational vs. Multidimensional View of Sales 29 13 Another example: Simple Star Schema 30 13 Possible Attributes for Sales Dimensions 31 13 Three-Dimensional View of Sales 32 13 Slice and Dice View of Sales 33 13 Location Attribute Hierarchy 34 13 Star Schema for Sales 35 13 Orders Star Schema 36 13 Normalized Dimension tables 37 13 Multiple Fact Tables 38 On-Line Analytical Processing (OLAP) Tools 13 • The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques • Relational OLAP (ROLAP) – Traditional relational representation • Multidimensional OLAP (MOLAP) – Cube structure • OLAP Operations – Cube slicing–come up with 2-D view of data – Drill-down–going from summary to more detailed views 39 Figure 11-23 Slicing a data cube 13 40 Figure 11-24 Example of drill-down Starting with summary data, users can obtain details for particular cells Summary report 13 Drill-down with color added 41 13 Implementing a Data Warehouse • Numerous constraints: – Available funding – Management’s view of the role played by an IS department and of the extent and depth of the information requirements – Corporate culture • No single formula can describe perfect data warehouse development 42 13 Factors Common to Data Warehousing • Data warehouse is not a static database • Dynamic framework for decision support that is always a work in progress • Data warehouse data cross departmental lines and geographical boundaries • Must satisfy: – Data integration and loading criteria – Data analysis capabilities with acceptable query performance – End-user data analysis needs • Apply database design procedures 43 13 Data Mining • Tools that: – analyze data – uncover problems or opportunities hidden in data relationships, – form computer models based on their findings, and then – use the models to predict business behavior • Require minimal end-user intervention 44 13 Extraction of Knowledge From Data 45 13 Data-Mining Phases 46 13 A Sample of Current Data Warehousing and Data-Mining Vendors 47 Data Mining and Visualization • • 13 Knowledge discovery using a blend of statistical, AI, and computer graphics techniques Goals: – Explain observed events or conditions – Confirm hypotheses – Explore data for new or unexpected relationships • Techniques – – – – – – – – – • Statistical regression Decision tree induction Clustering and signal processing Affinity Sequence association Case-based reasoning Rule discovery Neural nets Fractals Data visualization–representing data in graphical/multimedia formats for analysis 48 13 Summary • Data analysis is used to derive and interpret information from data • Decision support is a methodology designed to extract information from data and to use such information as a basis for decision making • Decision support system is an arrangement of computerized tools used to assist managerial decision making within a business • Data warehouse is an integrated, subjectoriented, time-variant, nonvolatile database that provides support for decision making 49 13 Summary (continued) • Online analytical processing is an advanced data analysis environment that supports decision making, business modeling, and operations research • Star schema is a data-modeling technique used to map multidimensional decision support data into a relational database • The implementation of any company-wide information system is subject to conflicting organizational and behavioral factors 50 13 Summary (continued) • Data mining automates analysis of operational data with the intention of finding previously unknown data characteristics, relationships, dependencies, and/or trends • Data warehouse is storage location for decision support data 51