Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) MONASH AUSTRALIA’S INTERNATIONAL UNIVERSITY The Data Warehouse (DW) and Business Intelligence (BI) 9.1 Lecture Outline Overview of Data Warehousing Data Warehouse Architecture Overview of Business Intelligence (BI) OLAP The Data Warehouse (DW) and Business Intelligence (BI) 9.2 What is a DW? A data store to support data analysis or decision support – Decision support: » a methodology to extract information from data – Decision support system: » an arrangement of computerized tools to assist in managerial decision making Answers questions by combining historical operational data with a business data model that reflects business activity Data may come from both operational and external sources – external data - e.g. industry average salaries The Data Warehouse (DW) and Business Intelligence (BI) 9.3 Data Warehouse Definitions - 1 The information in a DW is subject-oriented, nonvolatile, and of an historic nature, and so DWs tend to contain extremely large datasets The purpose of the DW is to provide the tools and facilities to manage and deliver complete, timely, accurate, and understandable business information to authorized individuals for effective business decision making DW implementation needs a company-wide effort that requires user involvement and commitment at all levels A successful DW implementation tracks return on investment The Data Warehouse (DW) and Business Intelligence (BI) 9.4 Data Warehouse Definitions - 2 A DW is a concept not a product – It is the compiling, assembling, and consolidating of application data common to user communities at a single logical point Typical use includes ad hoc queries, “what if”, data matching, trend analysis and other sophisticated information functions Warehouse data is typically extracted from OLTP systems A DW can be described as a read-only database that provides users with access to consolidated, historic, or static data extracted from operational databases, usually augmented with external data The Data Warehouse (DW) and Business Intelligence (BI) 9.5 Operational Data vs. the DW - 1 Integration – Data found within the DW is ALWAYS integrated, e.g. » encoding, measurements of attributes, etc. are standardized Normalized vs. denormalized – Operational data is normalized Timespan – Operational data is current – DW data is historical Granularity – Operational data is at transaction level – DW data is at an aggregation level The Data Warehouse (DW) and Business Intelligence (BI) 9.6 Operational Data vs. the DW - 2 Dimensionality – data is clustered according to functional requirements i.e. all orders to be delivered to a particular suburb – data analyst requires access to all dimensions Use – DW is read only The Data Warehouse (DW) and Business Intelligence (BI) 9.7 MIS, or Before the DW MIS: Management Information System required detailed knowledge of the operational systems no Business Information Directory data quality is ad hoc limited data integration from source systems integration and querying performed by MIS specialists using 3+GL tools such as SAS or at best performing queries using SQL against images of unintegrated operational databases The Data Warehouse (DW) and Business Intelligence (BI) 9.8 Inmon’s 12 Rules - 1 DW and operational environments are separated Integrated DW data DW contains historical data DW is snapshot data captured at particular point in time DW data is subject-oriented The Data Warehouse (DW) and Business Intelligence (BI) 9.9 Inmon’s 12 Rules - 2 No online update DW SDLC is data-driven DW contains several levels of data - raw to summarized Data sources are traced Meta-data is a critical component DW contains a charge back mechanism The Data Warehouse (DW) and Business Intelligence (BI) 9.10 DW Architecture Load Authoritative Source Source Systems External systems Extract / Enhance / Transform Layer Copy mgt Extract Transform Process once Business rules Consistency & controls Data Warehouse Customise DataMarts Separates data from application Build data for appropriate datamart Meets specific OLAP requirements Delivery to user Fully modelled & documented Parallel process Enterprise single image data view Denormalize for specific use Industry standard tools Tailored applications where appropriate Value add Business Information Directory The Data Warehouse (DW) and Business Intelligence (BI) 9.11 Source Systems/Authoritative Source must first identify authoritative source data Authoritative Source – atomic data from the creating/owning source system data propagation must be subject to a delivery contract data propagation is asynchronous – no reverse propagation – no periodic synchronization delivery must have minimal impact on operational systems The Data Warehouse (DW) and Business Intelligence (BI) 9.12 Extract/Enhance/Transform Layer must create integrated and standardized data deduping process happens here denormalize into a format for direct loading into the DW cleanse – must remove semantic and syntactic inconsistencies – return invalid data to the source system for repair requires a data quality process simple business transformations addition of surrogate keys and time variance The Data Warehouse (DW) and Business Intelligence (BI) 9.13 Handling Inserts/Deltas - 1 Scenarios – additions to a (1) New or (2) Existing partition – partitions are (1) Atomic or (2) Aggregates New partition - atomic or aggregate – work off-line – do summation outside of database and use efficient tools i.e.. Syncsort or C – then SQL*LOADER The Data Warehouse (DW) and Business Intelligence (BI) 9.14 Handling Inserts/Deltas - 2 Updates to an existing partition – Atomic Partition » Unload, Sort, Reload or » Insert directly into DB - concurrency issues – Aggregate Partition R1 R2 X 1 X 2 X 3 - stored in database R3 X 1 – Update directly to DW – Unload and update out of the database – Keep source data and re sort sum The Data Warehouse (DW) and Business Intelligence (BI) 9.15 The Data Warehouse contains atomic data Star Schema structure – contains » » » » Facts Dimensions Attributes - Surrogate keys Attribute Hierarchies Key Issues – size – data retention period - YTD – backup and recovery – security The Data Warehouse (DW) and Business Intelligence (BI) 9.16 Star Schemas a data modeling technique used to map decision support data into a relational database this structure is based on the premise that a highly normalized data structure do not serve advanced data analysis requirements well DimA Customer Cust# DimD Location Loc# Fact Table SALES Prod# DimB Product SalesrepID DimC Salesrep The Data Warehouse (DW) and Business Intelligence (BI) 9.17 Snowflake Schemas Customer State Customer Category Customer Address DimA Customer DimD Location Fact Table SALES Prod# DimB Product SalesrepID DimC Salesrep The Data Warehouse (DW) and Business Intelligence (BI) 9.18 Fact Tables Facts measure something of interest to an enterprise – atomic level or transactional data – summarization will reduce volume but may lose information CUST# C100 C100 PROD# P100 P200 TOTAL $1000 $2000 CUST# C100 C100 PROD# P100 P100 SALESREP DATE S1 1/12 S2 2/12 COST $510 $490 The Data Warehouse (DW) and Business Intelligence (BI) 9.19 Dimensions drill down to atomic data from dimensions or reference tables A Query – List sales of Product P100 for each State for each Month of 1999? Product P#=P100 PName Nuts PCat Location State=Each Region Time Year=1999 Month=Each The Data Warehouse (DW) and Business Intelligence (BI) 9.20 Attributes & Attribute Hierarchies each dimension table contains attributes surrogate keys are commonly added to improve performance of joins between Fact tables and their associated Dimensions attributes are used to search, filter of classify facts Attribute Hierarchies: classification attributes, e.g. SALES_REGION VIC, TAS The Data Warehouse (DW) and Business Intelligence (BI) 9.21 Datamarts/Customization/Cubes customization - select only the attributes and rows of interest for export to a datamart or data cube apply coding techniques to the attributes of interest suitable for search algorithm to be used each cell of a cube is a view consisting of an aggregation of interest – e.g. TOTAL_SALES used as a performance improving technique to – pre aggregate groupby cells – remove data not required for the problem at hand from the search algorithm The Data Warehouse (DW) and Business Intelligence (BI) 9.22 Business Intelligence & The DW most enterprises have a data repository to allow data analysis to occur database provide enabling techniques – efficient data storage and access – query optimization 80% of knowledge discovery in databases (KDD) is the preparation of the data - this is the data warehouse the evolution of the desktop, database, networks and AI/search has made it possible to perform KDD in commercial databases The Data Warehouse (DW) and Business Intelligence (BI) 9.23 The BI Process - 1 Understand and define the process Perform data collection and extraction Perform Data Cleaning and Exploration Data Engineering – select attributes of interest – select records of interest – map attributes to suit DM algorithms The Data Warehouse (DW) and Business Intelligence (BI) 9.24 The BI Process - 2 Algorithm Engineering – which algorithm to use – ability to deal with » quality of input » quality of output » performance Run the data mining algorithm Preliminary evaluation of the results Refine the data and the problem Use the results to implement a business strategy The Data Warehouse (DW) and Business Intelligence (BI) 9.25 A BI Model Learning Analysis Discovery Pattern Recognition Variables Model Prediction/ Verification Adaptive Modelling Answer Profit from targeted customers buying Product X/ Cost of Producing the Model and Predicting the Answer = Return on Investment The Data Warehouse (DW) and Business Intelligence (BI) 9.26 DM Techniques Verification Driven Data Mining Techniques – Naive evaluation - exhaustive search – Random walk – ad hoc query – OLAP – Hypothesis testing - statistics Discovery Driven Data Mining Techniques – Statistical Modeling (e.g. linear regression) – Visualization – Rule-based and inductive learning – Neural networks – Genetic algorithms (an optimization technique) The Data Warehouse (DW) and Business Intelligence (BI) 9.27 OLAP:On-Line Analytical Processing an environment for the analysis of multidimensional data – dice – rotate – drill-down – rollup OLAP provides advanced database support involving attribute selection, attribute encoding, row sampling, data cleansing and allows the use of multiple different search engines – easy to use user-interface – open system architecture using local processing power The Data Warehouse (DW) and Business Intelligence (BI) 9.28 References Rob, P. & Coronel, C. Database Systems: Design, Implementation, and Management, 3rd Ed., Nelson 1997 Inmon W. H. - numerous. See http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/ for example Kimball, R - numerous Golfarelli, M., Maio, D., and Rizzi, S. Conceptual Design of Data Warehouses from E/R Schemes, in Proceedings of the 31st Hawaii International Conference on System Sciences,1998 Lee A.J. and Rundensteiner, E. A Data Warehouse Evolution: Consistent Metadata Management. Gray, J. et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and Knowledge Discovery 1, pp. 29-53, 1997 Maier, D. et al. Selected Research Issues in Decision Support Databases Journal of Intelligent Information Systems, 11 (2), pp. 169191 1998 The Data Warehouse (DW) and Business Intelligence (BI) 9.29