Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1/29/2015 What is a Data Warehouse? • Defined in many different ways, but not rigorously. – A decision support database that is maintained separately from the COMP 465 Data Mining organization’s operational database – Support information processing by providing a solid platform of Data Warehousing consolidated, historical data for analysis. • “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3rd ed. process.”—W. H. Inmon • Data warehousing: – The process of constructing and using data warehouses 1/28/2015 COMP 465: Data Mining Spring 2015 1 COMP 465: Data Mining Spring 2015 2 Data Warehouse—Integrated Data Warehouse—Subject-Oriented • Constructed by integrating multiple, heterogeneous data sources – relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied. – Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • Organized around major subjects, such as customer, product, sales • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision • E.g., Hotel price: currency, tax, breakfast covered, etc. support process 1/29/2015 1/28/2015 – When data is moved to the warehouse, it is converted. COMP 465: Data Mining Spring 2015 3 1/29/2015 COMP 465: Data Mining Spring 2015 4 1 1/29/2015 Data Warehouse—Time Variant Data Warehouse—Nonvolatile • The time horizon for the data warehouse is significantly longer than that of operational systems – Operational database: current value data • A physically separate store of data transformed from the operational environment • Operational update of data does not occur in the data – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) warehouse environment – Does not require transaction processing, recovery, and • Every key structure in the data warehouse concurrency control mechanisms – Contains an element of time, explicitly or implicitly – Requires only two operations in data accessing: – But the key of operational data may or may not contain “time element” COMP 465: Data Mining Spring 2015 1/29/2015 • initial loading of data and access of data 5 1/29/2015 OLTP OLAP clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc read/write index/hash on prim. key short, simple transaction lots of scans unit of work # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access 1/29/2015 • • complex query COMP 465: Data Mining Spring 2015 6 Why a Separate Data Warehouse? OLTP vs. OLAP users COMP 465: Data Mining Spring 2015 • 7 High performance for both systems – DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery – Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data: – missing data: Decision support requires historical data which operational DBs do not typically maintain – data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources – data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Note: There are more and more systems which perform OLAP analysis directly on relational databases 1/29/2015 COMP 465: Data Mining Spring 2015 8 2 1/29/2015 Three Data Warehouse Models Data Warehouse: A Multi-Tiered Architecture Other sources Operational DBs Metadata Extract Transform Load Refresh Monitor & Integrator OLAP Server Serve Data Warehouse Analysis Query Reports Data mining • Enterprise warehouse – collects all of the information about subjects spanning the entire organization • Data Mart – a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart Data Marts Data Sources Data Storage OLAP Engine Front-End Tools • Virtual warehouse – A set of views over operational databases – Only some of the possible summary views may be materialized 1/29/2015 Extraction, Transformation, and Loading (ETL) COMP 465: Data Mining Spring 2015 10 Metadata Repository • Data extraction – get data from multiple, heterogeneous, and external sources • Data cleaning – detect errors in the data and rectify them when possible • Data transformation – convert data from legacy or host format to warehouse format • Load – sort, summarize, consolidate, compute views, check integrity, and build indices and partitions • Refresh – propagate the updates from the data sources to the warehouse 1/29/2015 COMP 465: Data Mining Spring 2015 11 • • • • • • • Meta data is the data defining warehouse objects. It stores: Description of the structure of the data warehouse – schema, view, dimensions, hierarchies, derived data definition, data mart locations and contents Operational meta-data – data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) The algorithms used for summarization The mapping from operational environment to the data warehouse Data related to system performance – warehouse schema, view and derived data definitions Business data – business terms and definitions, ownership of data, charging policies 1/29/2015 COMP 465: Data Mining Spring 2015 12 3 1/29/2015 From Tables and Spreadsheets to Data Cubes • Cube: A Lattice of Cuboids A data warehouse is based on a multidimensional data model which views all 0-D (apex) cuboid data in the form of a data cube • A data cube, such as sales, allows data to be modeled and viewed in time item location supplier 1-D cuboids multiple dimensions – Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) time,location time,item – Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables • item,location time,supplier location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids In data warehousing literature, an n-D base cube is called a base cuboid. time,item,location time,item,supplier The top most 0-D cuboid, which holds the highest-level of summarization, item,location,supplier 4-D (base) cuboid is called the apex cuboid. The lattice of cuboids forms a data cube. time, item, location, supplier 1/29/2015 COMP 465: Data Mining Spring 2015 13 COMP 465: Data Mining Spring 2015 1/29/2015 Example of Star Schema Conceptual Modeling of Data Warehouses time • Modeling data warehouses: dimensions & measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key some dimensional hierarchy is normalized into a set of location branch snowflake location_key branch_key branch_name branch_type – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called units_sold dollars_sold avg_sales galaxy schema or fact constellation COMP 465: Data Mining Spring 2015 item_key item_name brand type supplier_type branch_key smaller dimension tables, forming a shape similar to 1/29/2015 14 location_key street city state_or_province country Measures 15 1/29/2015 COMP 465: Data Mining Spring 2015 16 4 1/29/2015 Example of Fact Constellation Example of Snowflake Schema time time item time_key day day_of_the_week month quarter year item_key item_name brand type supplier_key Sales Fact Table time_key item_key supplier supplier_key supplier_type time_key day day_of_the_week month quarter year location_key branch location_key street city_key units_sold dollars_sold branch_key branch_name branch_type city city_key city state_or_province country avg_sales Measures COMP 465: Data Mining Spring 2015 1/29/2015 17 Europe country city office 1/29/2015 Germany Frankfurt ... ... ... Spain avg_sales 1/29/2015 L. Chan item_key shipper_key location to_location location_key street city province_or_state country dollars_cost COMP 465: Data Mining Spring 2015 units_shipped shipper shipper_key shipper_name location_key 18 shipper_type • Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning North_America Canada Vancouver ... COMP 465: Data Mining Spring 2015 dollars_sold time_key Data Cube Measures: Three Categories all region units_sold Measures A Concept Hierarchy: Dimension (location) all location_key Shipping Fact Table from_location branch_key location branch_key branch_name branch_type time_key item_key item_name brand type supplier_type item_key branch_key branch item Sales Fact Table ... • E.g., count(), sum(), min(), max() Mexico • Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function • E.g., avg(), min_N(), standard_deviation() Toronto • Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. ... M. Wind • E.g., median(), mode(), rank() 19 1/29/2015 COMP 465: Data Mining Spring 2015 20 5 1/29/2015 View of Warehouses and Hierarchies Multidimensional Data • Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Specification of hierarchies • Schema hierarchy Industry Region • Set_grouping hierarchy Category Country Quarter Product day < {month < quarter; week} < year Product {1..10} < inexpensive Month Date 3Qtr 4Qtr sum Day 22 Cuboids Corresponding to the Cube Total annual sales of TVs in U.S.A. all 0-D (apex) cuboid U.S.A Canada Mexico product Country TV PC VCR sum 2Qtr Month Week COMP 465: Data Mining Spring 2015 1/28/2015 1Qtr City Office A Sample Data Cube Year product,date country date 1-D cuboids product,country date, country 2-D cuboids sum 3-D (base) cuboid product, date, country 1/28/2015 COMP 465: Data Mining Spring 2015 23 1/28/2015 COMP 465: Data Mining Spring 2015 24 6 1/29/2015 Typical OLAP Operations • Roll up (drill-up): summarize data • Drill down (roll down): reverse of roll-up – by climbing up hierarchy or by dimension reduction – from higher level summary to lower level summary or detailed data, or introducing new dimensions • Slice and dice: project and select • Pivot (rotate): • Other operations Fig. 3.10 Typical OLAP Operations – reorient the cube, visualization, 3D to series of 2D planes – drill across: involving (across) more than one fact table – drill through: through the bottom level of the cube to its backend relational tables (using SQL) COMP 465: Data Mining Spring 2015 1/29/2015 25 A Star-Net Query Model Browsing a Data Cube Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK Time ANNUALY QTRLY DAILY PRODUCT LINE Product PRODUCT ITEM PRODUCT GROUP CITY SALES PERSON COUNTRY DISTRICT REGION Location Each circle is called a footprint DIVISION Promotion Organization • Visualization • OLAP capabilities • Interactive manipulation 7 1/29/2015 Design of Data Warehouse: A Business Analysis Framework Data Warehouse Design Process • • Four views regarding the design of a data warehouse Top-down, bottom-up approaches or a combination of both – Top-down: Starts with overall design and planning (mature) – Top-down view • allows selection of the relevant information necessary for the data warehouse – Bottom-up: Starts with experiments and prototypes (rapid) • From software engineering point of view – Waterfall: structured and systematic analysis at each step before proceeding to the next – Data source view • exposes the information being captured, stored, and managed by operational systems – Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around – Data warehouse view • Typical data warehouse design process • consists of fact tables and dimension tables – Choose a business process to model, e.g., orders, invoices, etc. – Business query view – Choose the grain (atomic level of data) of the business process – Choose the dimensions that will apply to each fact table record • sees the perspectives of data in the warehouse from the view of end-user COMP 465: Data Mining Spring 2015 1/28/2015 – Choose the measure that will populate each fact table record 29 Data Warehouse Development: A Recommended Approach Distributed Data Marts 30 Data Warehouse Usage • Multi-Tier Data Warehouse COMP 465: Data Mining Spring 2015 1/28/2015 Three kinds of data warehouse applications – Information processing • supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs – Analytical processing • multidimensional analysis of data warehouse data Data Mart Data Mart Model refinement Enterprise Data Warehouse • supports basic OLAP operations, slice-dice, drilling, pivoting – Data mining • knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools Model refinement Define a high-level corporate data model 1/29/2015 COMP 465: Data Mining Spring 2015 32 8 1/29/2015 From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM) Efficient Data Cube Computation • Data cube can be viewed as a lattice of cuboids • Why online analytical mining? – High quality of data in data warehouses • DW contains integrated, consistent, cleaned data – Available information processing structure surrounding data warehouses • ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools – OLAP-based exploratory data analysis • Mining with drilling, dicing, pivoting, etc. – On-line selection of data mining functions • Integration and swapping of multiple mining functions, algorithms, and tasks 1/29/2015 COMP 465: Data Mining Spring 2015 – The bottom-most cuboid is the base cuboid – The top-most cuboid (apex) contains only one cell – How many cuboids in an n-dimensional cube with L levels? n T ( Li 1) i 1 • Materialization of data cube – Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) – Selection of which cuboids to materialize • Based on size, sharing, access frequency, etc. 33 1/29/2015 COMP 465: Data Mining Spring 2015 34 Summary • Data warehousing: A multi-dimensional model of a data warehouse – A data cube consists of dimensions & measures – Star schema, snowflake schema, fact constellations – OLAP operations: drilling, rolling, slicing, dicing and pivoting • Data Warehouse Architecture, Design, and Usage – Multi-tiered architecture – Business analysis design framework – Information processing, analytical processing, data mining, OLAM (Online Analytical Mining) • Implementation: Efficient computation of data cubes – Partial vs. full vs. no materialization 40 9