Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 4 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2010 Han, Kamber & Pei. All rights reserved. 1 May 22, 2017 Data Mining: Concepts and Techniques 2 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Warehouse Implementation Data Generalization by Attribute-Oriented Induction Summary 3 What is a Data Warehouse? Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: The process of constructing and using data warehouses 4 Data Warehouse—Subject-Oriented Organized around major subjects, such as customer, supplier, product, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process 5 Data Warehouse—Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied to Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. 6 Data Warehouse—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element” 7 Data Warehouse—Nonvolatile A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data 8 Operational Databases Vs Data Warehouses Online Transaction Processing (OLTP) Operational databases perform online transaction and query processing Day-to-day operations, e.g. purchasing, inventory, manufacturing, banking, payroll Online Analytical Processing (OLAP) Data Warehouse systems Data analysis and decision making Organize and present data in various formats 9 OLTP vs. OLAP 10 Why a Separate Data Warehouse? High performance for both systems Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data: DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Note: There are more and more systems which perform OLAP analysis directly on relational databases 11 Data Warehouse: A Multi-Tiered Architecture 12 Data Warehouse: A Multi-Tiered Architecture Bottom-tier Warehouse database server – relational database system Back-end tools and utilities feed data from operational databases – extract, clean, transform, load and refresh Data extracted using APIs called gateways – ODBC, OLEDB, JDBC Metadata repository Middle tier OLAP server – relational OLAP (ROLAP) or multidimensional OLAP (MOLAP) Top tier Front-end client layer – query, reporting tools, analysis tools, data mining tools 13 Three Data Warehouse Models Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized 14 Top-down and bottom-up approaches Top-down development of enterprise warehouse Systematic solution, minimizes integration problems Expensive, time consuming, lacks flexibility Bottom-up design, development, deployment of independent data marts Flexibility, low cost, rapid return of investment Problems in integrating disparate data marts into a consistent enterprise data warehouse 15 Data warehouse - Third approach Incremental and evolutionary method of development of warehouse Define high-level corporate data model Provides corporate-wide, consistent, integrated view of data among different subjects and users Independent data marts can be implemented in parallel with enterprise warehouse Distributed data marts can be constructed to integrate different data marts Finally, a multitier data warehouse is constructed – enterprise warehouse custodian of all warehouse data Then distributed to the various dependent data marts 16 Data warehouse construction 17 Extraction, Transformation, and Loading (ETL) Data extraction get data from multiple, heterogeneous, and external sources Data cleaning detect errors in the data and rectify them when possible Data transformation convert data from legacy or host format to warehouse format Load sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh propagate the updates from the data sources to the warehouse 18 Metadata Repository Meta data is the data defining warehouse objects. It stores: Description of the structure of the data warehouse Operational meta-data schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) The algorithms used for summarization Measure and dimension definition, data on granularity, partitions, subject areas, aggregation, summarization, predefined queries and reports 19 Metadata Repository The mapping from operational environment to the data warehouse Source databases and their content, gateway descriptions, data partitions, data extraction, cleaning, transformation rules and defaults, data refresh and purging rules, security (user authorization and access control) Data related to system performance Indices and profiles that improve data access and retrieval and performance, rules for timing and scheduling of refresh , update and replication cycles Business data May 22, 2017 business terms and definitions, ownership of data, charging policies Data Mining: Concepts and Techniques 20 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Warehouse Implementation Data Generalization by Attribute-Oriented Induction Summary 21 From Tables and Spreadsheets to Data Cubes A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables 22 2-D and 3-D Data Cubes 2-D Data: Dimensions – time, item, Measure – dollars_sold, Location – Vancouver 3-D Data: Dimensions – time, item, location, Measure – dollars_sold 23 3-D and 4-D Data cubes 3-D cube Dimensions – time, item, location 4-D cube as a series of 3-D cubes Dimensions – time, item, location, supplier 24 Data cube – multidimensional data storage Actual physical storage may differ from its logical representation Data cubes – n-dimensional In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. May 22, 2017 Data Mining: Concepts and Techniques 25 A Sample Data Cube 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico Country TV PC VCR sum 1Qtr Date Total annual sales of TVs in U.S.A. sum 26 Cube: A Lattice of Cuboids all time 0-D (apex) cuboid item time,location time,item location supplier item,location time,supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,location time,item,supplier item,location,supplier 4-D (base) cuboid time, item, location, supplier 27 Cuboids Corresponding to the Cube all 0-D (apex) cuboid product product,date date country product,country 1-D cuboids date, country 2-D cuboids 3-D (base) cuboid product, date, country 28 Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 29 Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key street city state_or_province country Measures 30 Example of Snowflake Schema time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city state_or_province country 31 Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type item_key location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures time_key item_key shipper_key from_location branch_key branch Shipping Fact Table location to_location location_key street city province_or_state country dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type 32 Schemas for Data Warehouse Data warehouses Fact constellation schema commonly used It can model multiple, interrelated subjects Data Mart Department subset of the data warehouse focusing on selected subjects Star or snowflake schema commonly used Geared towards modeling single subjects May 22, 2017 Data Mining: Concepts and Techniques 33 Dimensions – Concept Hierarchies Concept Hierarchies – sequence of mappings from a set of low-level concepts to higher-level concepts Many concept hierarchies are implicit within the database systems Attributes in a dimension may be related by a total order (hierarchy) or partial order (lattice) Total order: attributes of the dimension location Street < city < province_or_state < country Partial order: attributes of dimension time Day < {month < quarter; week} < year May 22, 2017 Data Mining: Concepts and Techniques 34 A Concept Hierarchy: Dimension (location) all all Europe region country city office Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... L. Chan ... ... Mexico Toronto M. Wind 35 Concept Hierarchy (a) location as a hierarchy (total order) (b) time as a lattice (partial order) Schema hierarchy – a concept hierarchy that is a total or a partial order among attributes in a database schema 36 Multidimensional Data Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product Product City Office Month Week Day Month 37 Concept Hierarchy Set-grouping hierarchy – concept hierarchy defined by discretizing or grouping values for a given dimension or attribute A total or partial order can be defined among groups of values 38 View of Warehouses and Hierarchies Specification of hierarchies Schema hierarchy day < {month < quarter; week} < year Set_grouping hierarchy {1..10} < inexpensive 39 Data Cube Measures: Three Categories Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data (without partitioning), the function can be computed in a distributed manner Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function E.g., count(), sum(), min(), max() E.g., avg(), min_N(), max_N(), standard_deviation() Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. E.g., median(), mode(), rank() 40 Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: select and project Slice – select on one dimension resulting in a subcube Dice – select two or more dimensions to get subcube Pivot (rotate): Rotate the axes in the cube for visualization, transform 3D cube to a series of 2D planes 41 Typical OLAP Operations Other operations drill across: executes queries involving (across) more than one fact table drill through: the bottom level of a data cube to its back-end relational tables (using SQL) Ranking the top N or bottom N items in lists, computing averages, growth rates, interests, internal rates of return, depreciation, currency conversions and statistical functions OLAP offers analytical modeling capabilities Calculation engine for deriving ratios, variance, computing measures across multiple dimensions 42 Fig. 3.10 Typical OLAP Operations 43 OLAP Vs Statistical Databases Statistical database is a database system that is designed to support statistical applications SDBs focus on socieconomic applications Privacy issues regarding concept hierarchies Summarized socioeconomic data – view corresponding lowlevel data is controversial OLAP is targeted for business applications Designed to handle huge amounts of data efficiently 44 A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP CITY SALES PERSON COUNTRY DISTRICT REGION Location Each circle is called a footprint DIVISION Promotion Organization 45 Browsing a Data Cube Visualization OLAP capabilities Interactive manipulation 46 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Warehouse Implementation Data Generalization by Attribute-Oriented Induction Summary 47 Design of Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view allows selection of the relevant information necessary for the data warehouse This information matches the current and future business needs Data source view exposes the information being captured, stored, and managed by operational systems Information documented at various level of details and accuracy, from individual data source tables to integrated data source tables 48 Design of Data Warehouse: A Business Analysis Framework Data warehouse view consists of fact tables and dimension tables Information stored inside data warehouse – precalculated totals and counts, information regarding source, data and time of origin for historical context Business query view sees the perspectives of data in the warehouse from the view of end-user 49 Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record 50 Data Warehouse Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Data Mart Data Mart Model refinement Enterprise Data Warehouse Model refinement Define a high-level corporate data model 51 Data Warehouse Design Process Goals of Data warehouse implementation – specific, achievable, measurable Determine the time and budget allocations, subset of organization to be modeled, number of data sources selected, number and types of departments to be served Initial deployment Initial installation, roll-out planning, training, and orientation, platform upgrades and maintenance Data warehouse administration Data refreshment, data source synchronization, planning for disaster recovery, managing access control and security, managing data growth, performance 52 Data Warehouse Usage Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery by finding hidden patterns , associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools 53 From Online Analytical Processing (OLAP) to Online Analytical Mining (OLAM) Why online analytical mining? High quality of data in data warehouses DW contains integrated, consistent, cleaned data DW constructed by such preprocessing serves as valuable source of high-quality data for OLAP and data mining Available information processing structure surrounding data warehouses Information processing and data analysis infrastructure surrounding DW Accessing, integration, consolidation, transformation of multiple heterogeneous databases ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP analysis tools 54 From Online Analytical Processing (OLAP) to Online Analytical Mining (OLAM) OLAP-based exploration of multidimensional data Multidimensional data mining provides facilities for mining on different subsets of data and at varying levels of abstraction, by drilling, filtering, dicing, pivoting, etc. Data/knowledge visualization tools enhance the power of data mining On-line selection of data mining functions Integration of OLAP with data mining functions allows to select desired data mining functions and swap data mining tasks dynamically 55 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Warehouse Implementation Data Generalization by Attribute-Oriented Induction Summary 56 The “Compute Cube” Operator Cube definition and computation in DMQL define cube sales [item, city, year]: sum (sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) () SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys (city) (city, item) (city, item, year), (city,item),(city, year), (item,year), (city), (item), (year) () (item) (city, year) (year) (item, year) (city, item, year) 57 The “Compute Cube” Operator The compute cube operator computes aggregates over all subsets of the dimensions specified in the operation A cube operator on n dimensions is equivalent to a collection of group by statements, one for each subset of the n dimensions Precomputation – compute all or some cuboids in advance Leads to fast response time, avoids some redundant computation Storage space may explode if all of the cuboids in a data cube are precomputed Curse of dimensionality – storage requirements excessive when many dimensions have associated concept hierarchies, each with multiple levels 58 Efficient Data Cube Computation Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with Li levels for dimension i? n T ( Li 1) i 1 If cube has 10 dimensions, each dimension has 5 levels (including all), total cuboids is 510 9.8x106 Size of each cuboid also depends on the cardinality (i.e. #of distinct values) of each dimension As #of dimensions, #of concept hierarchies, cardinality increases, storage space will exceed size of input 59 Efficient Data Cube Computation Materialization of data cube No materialization: do not precompute any of the “nonbase” cuboids Full materialization: precompute all of the cuboids Leads to computing expensive multidimensional aggregates on the fly, which is extremely slow Requires huge amounts of memory space Partial materialization: selectively compute a proper subset of the whole set of possible cuboids Subcube – only some of the cells may be precomputed for various cuboids 60 Partial materialization of cuboids Partial materialization of cuboids or subcubes should consider three factors: Identify the subset of cuboids or subcubes to materialize Exploit the materialized cuboids or subcubes during query processing Efficiently update the materialized cuboids or subcubes during load and refresh The selection of the subset of cuboids or subcubes should take into account Queries in the workload, their frequencies, their accessing costs, workload characteristics, cost of incremental updates, total storage requirements 61 Partial materialization of cuboids A popular approach is to materialize the set of cuboids on which other frequently referenced cuboids are based Iceberg cube – data cube that stores only those cube cells whose aggregate value (e.g. count) above some minimum support threshold Shell cube – precomputing the cuboids for only a small number of dimensions (3 to 5) of a data cube Queries on additional combinations of the dimensions can be computed on-the-fly 62 Indexing OLAP Data: Bitmap Index Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high cardinality domain as well [Wu, et al. TODS’06] Base table Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe Index on Region Index on Type Type RecIDAsia Europe America RecID Retail Dealer Retail 1 1 0 1 1 0 0 Dealer 2 2 0 1 0 1 0 Dealer 3 1 0 0 3 0 1 4 0 0 1 4 1 0 Retail 0 1 0 5 0 1 Dealer 5 63 Compression of Bitmap Indices Bitmap indexes must be compressed to reduce I/O costs and minimize CPU usage—majority of the bits are 0’s Two compression schemes: Byte-aligned Bitmap Code (BBC) Word-Aligned Hybrid (WAH) code Time and space required to operate on compressed bitmap is proportional to the total size of the bitmap Optimal on attributes of low cardinality as well as those of high cardinality. WAH out performs BBC by about a factor of two 64 Indexing OLAP Data: Join Indices Join index: JI(R-id, S-id) where R (R-id, …) S (S-id, …) Traditional indices map the values to a list of record ids It materializes relational join in JI file and speeds up relational join In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table. E.g. fact table: Sales and two dimensions location and item A join index on location maintains for each distinct location a list of R-IDs of the tuples recording the Sales in the location Join indices can span multiple dimensions to form composite join indices 65 Indexing OLAP Data: Join Indices Linkages between a sales fact table and dimension tables for location and item Join index tables based on linkages between the sales fact table and dimension tables for location and item 66 Efficient Processing OLAP Queries Determine which operations should be performed on the available cuboids Transform any selection, projection, drill-down, roll-up (group-by), etc. into corresponding SQL and/or OLAP operations, e.g., slice, dice = selection and/or projection operations on a materialized cuboid Determine which materialized cuboid(s) should be selected for OLAP op. This involves identifying all of the materialized cuboids that may potentially be used to answer the query Pruning the above set using knowledge of “dominance” relationships among the cuboids Estimating the costs of using the remaining materialized cuboids Selecting the cuboid with the least cost 67 Efficient Processing OLAP Queries Example Let the query to be processed be on {brand, province_or_state} with the condition “year = 2010”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2010 Which should be selected to process the query? Finer-granularity data cannot be generated from coarser-granularity data So, cuboid 2 cannot be used as country is more general concept than province_or_state 68 Efficient Processing OLAP Queries Cuboids 1, 3 and 4 can be used to process the query as they have the same set or superset of the dimensions in the query the selection clause in the query can imply the selection in the cuboid The abstraction levels for the item and location dimensions in these cuboids are at a finer level than brand and province_or_state, respectively Cuboid 1 would cost most – item_name and city at lower level than brand and province_or_state If few year values associated with items in the cube, but there are several item_names for each brand, then cuboid 3 is smaller than cuboid 4 If efficient indices available for cuboid 4 then cuboid 4 is a better choice Explore indexing structures and compressed vs. dense array structs in MOLAP 69 OLAP Server Architectures Relational OLAP (ROLAP) Intermediate servers between a relational back-end server and client front-end tools Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middleware Include optimization for each DBMS backend, implementation of aggregation navigation logic, and additional tools and services Greater scalability than MOLAP technology E.g. DSS server of Microstrategy Multidimensional OLAP (MOLAP) Support multidimensional views of data through array-based multidimensional storage engine Map multidimensional views directly to data cube array structures Fast indexing to pre-computed summarized data 70 OLAP Server Architectures with multidimensional data stores, the storage utilization may be low if the data set is sparse Two-level storage representation: Sparse subcubes employ compression technology for efficient storage utilization Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) Combine ROLAP and MOLAP Greater scalability of ROLAP and faster computation of MOLAP Denser subcubes are identified and stored as array structures Large volumes of detail data stored in relational database, while aggregations kept in MOLAP store Specialized SQL servers (e.g., Redbricks) Specialized support for SQL queries over star/snowflake schemas in a read-only environment 71 Storage of ROLAP ROLAP uses relational tables to store data for OLTP Base fact table – fact table associated with base cuboid Base fact table stores data at the abstraction level indicated by the join keys in the schema for given data cube Summary fact tables - Aggregated data stored in fact tables Some summary fact tables store both base fact table data and aggregated data Summary fact table :Single table for base and summary facts 72 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Warehouse Implementation Data Generalization by Attribute-Oriented Induction Summary 73 Data Generalization Data generalization summarizes data by replacing relatively low-level values (numeric values for age) with higher-level concepts (young, middle, senior) or by reducing the number of dimensions (removing birth-date and telephone numbers for group of students) Ex: individual customer transactions prefer to view data summarized by customer groups according to geographic regions, frequency of purchases per group, customer income Concept description – form of data generalization, refers to a collection of data e.g. frequent_buyers, graduate_students generates descriptions for the characterization and comparison of the data Characterization provides a concise and succinct summarization of the given collection of data Concept or class comparison (discrimination) provides descriptions comparing two or more collections of data 74 Concept Description for Large Data Sets Complex data types and aggregation Many OLAP systems confine dimensions (attributes) to nonnumeric data and measures (aggregate functions) to numeric data Existing data types: numeric, non-numeric, spatial, text, image Aggregation of attributes in a database – sophisticated data types, e.g. collection of non-numeric data, merging of spatial regions, composition of images, integration of texts, grouping of object pointers User-control Vs. automation OLAP operations (drill-down, roll-up, slice, dice) are user controlled Desirable to have automated process to help users determine which dimensions should be included in the analysis, and the degree of generalization to obtain interesting summarization 75 Attribute-Oriented Induction (AOI) for Data Characterization AOI approach to concept description Proposed in 1989 (KDD ‘89 workshop) Data cube based materialized views of the data, which is precomputed in a data warehouse AOI – query-oriented, generalization-based, online data analysis technique Collect the task-relevant data (initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Different forms of presentation – charts, rules, etc. 76 Attribute-Oriented Induction: An Example Example: Describe general characteristics of graduate students in the Big University database Data Mining query in data mining query language DMQL: use Big_University_DB mine characteristics as “Science_students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in “graduate” Step 1: Data focusing should be performed before attribute-oriented induction, i.e. specification of the taskrelevant data 77 Attribute-Oriented Induction: An Example Specifying the set of relevant attributes may be difficult for the user User selects few attributes Ex: dimension birth_place defined by attributes city, province_or_state, country User specifies only city To allow generalization on birth_place dimension, other attributes should also be included i.e. system automatically includes province_or_state, country; helps generalize city to higher conceptual level User selects too many attributes E.g. in relevance to * (all attributes specified by from included) Relevance analysis: correlation-based, entropy-based methods, attribute subset selection where status in “graduate”: concept hierarchy exists for status Organize primitive level data into higher conceptual levels M.Sc., M.A., M.B.A., Ph.D., B.Sc., B.A. – graduate and undergraduate 78 Attribute-Oriented Induction: An Example Step 1. Data mining query transformed into relational query for collection of task-relevant set of data use Big_University_DB Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“M.Sc.”, “M.A.”, “M.B.A.”, “Ph.D.” } Each tuple in figure (next slide) is a conjunction of attribute-value pairs Step 2. Perform attribute-oriented induction on initial working relation by (1) attribute removal (2) attribute generalization Step 3. Present results in generalized relation, cross-tab, or rule forms 79 Class Characterization: An Example Name Initial Working Relation (taskrelevant) Gender Jim Woodman Scott Lachance Laura Lee … M F … Removed Retained Major Residence Phone # GPA M M F … Birth_date Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … … 3511 Main St., Richmond 345 1st Ave., Richmond 687-4598 3.67 253-9106 3.70 125 Austin Ave., Burnaby … 420-5232 … 3.83 … Sci,Eng, Bus City Removed Excl, VG,.. CS Gender Major Prime Generalized Relation Birth-Place Science Science … Country Age range Birth_region Age_range Residence GPA Canada Foreign … 20-25 25-30 … Richmond Burnaby … Very-good Excellent … Count 16 22 … Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 80 Basic Principles of Attribute-Oriented Induction Data focusing: task-relevant data, including dimensions, and the result is the initial relation Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A How large is a “large set of distinct values for an attribute” Attribute generalization threshold control Generalized relation threshold control 81 Attribute-Oriented Induction – generalization control Attribute generalization threshold control: (1) One generalization threshold for all attributes (2) One threshold for each attribute If #of distinct values in an attribute > attribute threshold => perform further attribute removal or attribute generalization Attribute threshold typically 2-8, specified/default Generalized relation threshold control: Sets a threshold for the generalized relation If #of distinct tuples in the generalized relation > threshold => perform further generalization Either preset within a range 10-30, or set by an expert 82 Attribute-Oriented Induction – quantitative/statistical information Important to accumulate count and other aggregate values in the induction process Aggregate function count is associated with each database tuple Initialize count = 1 for each tuple in initial working relation Through attribute removal and attribute generalization, tuples may be generalized => groups of identical tuples Identical tuples are merged to form one tuple Count of new tuple = #of tuples merged Other popular aggregate functions to associate with each tuple – sum and avg 83 Attribute-Oriented Induction - Example Example: Big_University_DB name: large #of distinct values and no generalization operation defined on it => remove this attribute gender: only 2 distinct values => retained, no generalization major: concept hierarchy defined as {arts&science, engineering, business}. Attribute generalization threshold = 5 #of distinct values > 20 major is generalized by climbing the concept hierarchy birth_place: large #of distinct value => generalize Concept hierarchy defined as city < province_or_state < country 84 Attribute-Oriented Induction - Example If #of distinct values for country > threshold => remove birth_place Because even though a generalization operator exists for it, generalization threshold would not be satisfied If #of distinct values for country < threshold => generalize to birth_country birth_date: suppose a hierarchy exists that can generalize birth_date to age and age to age_range Let #of age ranges is small with respect to the attribute generalization threshold Generalization should take place residence: defined by attributes number, street, residence_city, residence_province_or_state, residence_country 85 Attribute-Oriented Induction - Example #of distinct values for number and street will be high Attributes number and street should be removed Generalize to residence_city which contains fewer distinct values phone#: too many distinct values, so remove gpa: suppose concept hierarchy exists that groups values into numeric intervals {3.75-4.00, 3.5-3.75,…}, which are grouped into descriptive values {excellent, good,…} The attribute can therefore be generalized 86 87 Attribute-Oriented Induction: Basic Algorithm InitialRel: Query processing of task-relevant data, deriving the initial relation. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. 88 Presentation of Generalized Results Generalized relation: Cross tabulation: Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Mapping results into cross tabulation form (similar to contingency tables). Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms. Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., grad( x) male( x) birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%]. 89 Mining Class Comparisons Comparison: Comparing two or more classes (or concepts) Target and contrasting classes must be comparable in the sense that they share similar dimensions and attributes E.g. three classes person, address, item are not comparable whereas sales in the last three years are comparable classes Method: Data collection: collect relevant data by query processing and partition the set into the target class and the contrasting classes Dimension relevance analysis: if many dimensions, select only the highly relevant dimensions for further analysis. Correlation and entropy-based measures can be used for this step Synchronous generalization: Generalization is performed on the target class to the level controlled by a user/expert specified dimension threshold, resulting in prime target class relation Generalize contrasting class(es) to the same high level concepts 90 Mining Class Comparisons Compare tuples with the same high level descriptions Presentation of the derived comparisons: The resulting class comparison description can be visualized in the form of tables, graphs, and rules Presentation includes “contrasting” measure such as count% The user can adjust the comparison description by applying drilldown, roll-up, and other OLAP operations to the target and contrasting classes 91 Mining Class Comparisons Example: compare general properties between graduate students and undergraduate students at Big_University_DB use Big_University_DB mine comparison as “grad_vs_undergrad_students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa for “graduate_student” where status in “graduate” versus “undergraduate_students” where status in “undergraduate” analyze count% from student 92 Mining Class Comparisons First, the query is transformed into two relational queries that collect two sets of task-relevant data – one for the initial target class working relation and the other for the initial contrasting class working relation. Initial working relations: the target class (graduate students) Initial working relations: the contrasting class (undergraduate students) 93 Mining Class Comparisons Second, dimension relevance analysis can be performed, when necessary, on the two classes of data. After this analysis, irrelevant or weakly relevant dimensions, such as name, gender, birth_place, residence, and phone# are removed from the resulting classes. Third, synchronous generalization is performed: Generalization is performed on the target class to the levels controlled by user/expert-specified dimension thresholds, forming the prime target class relation. The contrasting class is generalized to the same levels as those in the prime target class relation, forming the prime contrasting class(es) relation. 94 Mining Class Comparisons Finally, the resulting class comparison is presented in the form of tables, graphs, and/or rules Contrasting measures such as count% compare between target class and contrasting class Prime generalized relation for the target class (graduate students) Prime generalized relation for the contrasting class (undergraduate students) 95 Concept Description vs. Cube-Based OLAP Similarity: Data generalization Presentation of data summarization at multiple levels of abstraction Interactive drilling, pivoting, slicing and dicing Differences: OLAP has systematic preprocessing, query independent, and can drill down to rather low level AOI has automated desired level allocation, and may perform dimension relevance analysis/ranking when there are many relevant dimensions AOI works on the data which are not in relational forms 96 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Warehouse Implementation Data Generalization by Attribute-Oriented Induction Summary 97 Summary Data warehousing: A multi-dimensional model of a data warehouse A data cube consists of dimensions & measures Star schema, snowflake schema, fact constellations OLAP operations: drilling, rolling, slicing, dicing and pivoting Data Warehouse Architecture, Design, and Usage Multi-tiered architecture Business analysis design framework Information processing, analytical processing, data mining, OLAM (Online Analytical Mining) Implementation: Efficient computation of data cubes Partial vs. full vs. no materialization Indexing OALP data: Bitmap index and join index OLAP query processing OLAP servers: ROLAP, MOLAP, HOLAP Data generalization: Attribute-oriented induction 98 References (I) S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997 E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July 1993. J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997. A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999. J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107, 1998. V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96 99 References (II) C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and Dimensional Techniques. John Wiley, 2003 W. H. Inmon. Building the Data Warehouse. John Wiley, 1996 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2ed. John Wiley, 2002 P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998 A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00. S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94 P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987. J. Widom. Research problems in data warehousing. CIKM’95. K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans. on Database Systems (TODS), 31(1), 2006, pp. 1-38. 100 May 22, 2017 Data Mining: Concepts and Techniques 101 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts (a) What Is a Data Warehouse? (b) Data Warehouse: A Multi-Tiered Architecture (c) Three Data Warehouse Models: Enterprise Warehouse, Data Mart, ad Virtual Warehouse (d) Extraction, Transformation and Loading (e) Metadata Repository Data Warehouse Modeling: Data Cube and OLAP (a) Cube: A Lattice of Cuboids (b) Conceptual Modeling of Data Warehouses (c) Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases (d) Dimensions: The Role of Concept Hierarchy (e) Measures: Their Categorization and Computation (f) Cube Definitions in Database systems (g) Typical OLAP Operations (h) A Starnet Query Model for Querying Multidimensional Databases Data Warehouse Design and Usage (a) Design of Data Warehouses: A Business Analysis Framework (b) Data Warehouses Design Processes (c) Data Warehouse Usage (d) From On-Line Analytical Processing to On-Line Analytical Mining Data Warehouse Implementation (a) Efficient Data Cube Computation: Cube Operation, Materialization of Data Cubes, and Iceberg Cubes (b) Indexing OLAP Data: Bitmap Index and Join Index (c) Efficient Processing of OLAP Queries (d) OLAP Server Architectures: ROLAP vs. MOLAP vs. HOLAP Data Generalization by Attribute-Oriented Induction (a) Attribute-Oriented Induction for Data Characterization (b) Efficient Implementation of Attribute-Oriented Induction (c) Attribute-Oriented Induction for Class Comparisons (d) Attribute-Oriented Induction vs. Cube-Based OLAP Summary 102