Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Summary of Last Chapter Principles of Knowledge Discovery in Data • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? Fall 2004 • What kind of data can be mined? Chapter 2: Data Warehousing and OLAP • What can be discovered? Dr. Osmar R. Zaïane • Is all that is discovered interesting and useful? • How do we categorize data mining systems? • What are the issues in Data Mining? University of Alberta Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data • Are there application examples? University of Alberta 1 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 2 Course Content Chapter 2 Objectives • • • • • • • • • • Introduction to Data Mining Data warehousing and OLAP Data cleaning Data mining operations Data summarization Association analysis Classification and prediction Clustering Web Mining Similarity Search • Other topics if time permits Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data Realize the purpose of data warehousing. Comprehend the data structures behind data warehouses and understand the OLAP technology. Get an overview of the schemas used for multi-dimensional data. University of Alberta 3 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 4 Data Warehouse and OLAP Outline Incentive for a Data Warehouse • Businesses have a lot of data, operational data and facts. • This data is usually in different databases and in different physical places. • Data is available (or archived), but in different formats and locations. (heterogeneous and distributed). • What is a data warehouse and what is it for? • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Decision makers need to access information (data that has been summarized) virtually on one single site. • This access needs to be fast regardless of the size of the data, and how old the data is. • Are there issues related to data cube technology? • Can we mine data warehouses? Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 5 1970s ys nal ls oo is T •Data Analyst Flexible integrated spreadsheets. Slow access to operational data Principles of Knowledge Discovery in Data g nd ssin g a roce P sin hou cal are alyti ta W An Da -Line On A ata •Data Analyst Inflexible and non-integrated tools D op skt De al nu Dr. Osmar R. Zaïane, 1999-2004 s em yst d ase rt S l-b po ina up rm n S Te cisio De a dM an tch ng Ba porti Re •Statistician •Computer scientist Difficult and limited queries highly specific to some distinctive needs •Executive Integrated tools Data Mining University of Alberta University of Alberta 6 • A data warehouse consolidates different data sources. • A data warehouse is a database that is different and maintained separately from an operational database. • A data warehouse combines and merges information in a consistent database (not necessarily up-to-date) to help decision support. 1990s 1980s Principles of Knowledge Discovery in Data What Is Data Warehouse? Evolution of Decision Support Systems 1960s Dr. Osmar R. Zaïane, 1999-2004 7 Decision support systems access data warehouse and do not need to access operational databases Î do not unnecessarily over-load operational databases. Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 8 Definitions Definitions (con’t) Data Warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. (W.H. Inmon) Data Warehousing is the process of constructing and using data warehouses. Subject oriented: oriented to the major subject areas of the corporation that have been defined in the data model. A corporate data warehouse collects data about subjects spanning the whole organization. Data Marts are specialized, single-line of business warehouses. They collect data for a department or a specific group of people. Integrated: data collected in a data warehouse originates from different heterogeneous data sources. Time-variant: The dimension “time” is all-pervading in a data warehouse. The data stored is not the current value, but an evolution of the value in time. Non-volatile: update of data does not occur frequently in the data warehouse. The data is loaded and accessed. Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 9 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 10 Data Warehouse and OLAP Outline Building a Data Warehouse • What is a data warehouse and what is it for? Option 1: Consolidate Data Marts Data Mart Data Mart Data Mart Corporate Data Warehouse Data Mart • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? Option 2: Build from scratch • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Are there issues related to data cube technology? Corporate data Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data • Can we mine data warehouses? University of Alberta 11 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 12 Construction of Data Warehouse Based on Multi-dimensional Model Describing the Organization We sell products in various markets, and we measure our performance over time • Think of it as a cube with labels on each edge of the cube. • The cube doesn’t just have 3 dimensions, but may have many dimensions (N). • Any point inside the cube is at the intersection of the coordinates defined by the edge of the cube. • A point in the cube may store values (measurements) relative to the combination of the labeled dimensions. e Products Data Warehouse Designer Dr. Osmar R. Zaïane, 1999-2004 Ti m We sell Products in various Markets, and we measure our performance over Time Markets Business Manager Principles of Knowledge Discovery in Data University of Alberta 13 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 14 Data Warehouse and OLAP Outline Concept-Hierarchies • What is a data warehouse and what is it for? Dimensions are hierarchical by nature: total orders or partial orders Example: Location(continent Î country Î province Î city) Time(yearÎquarterÎ(month,week)Îday) • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? Industry Dimensions: Product, Region, week Hierarchical summarization paths Country Category Region Product City Office Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data Year • What is the general architecture of a data warehouse? Quarter • How can we implement a data warehouse? Month Week • Are there issues related to data cube technology? Day University of Alberta • Can we mine data warehouses? 15 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 16 On-Line Transaction Processing On-Line Analytical Processing • On-line analytical processing (OLAP) is essential for decision support. • OLAP is supported by data warehouses. • Data warehouse consolidation of operational databases. • The key structure of the data warehouse always contains some element of time. •Owing to the hierarchical nature of the dimensions, OLAP operations view the data flexibly from different perspectives (different levels of abstractions). • Database management systems are typically used for on-line transaction processing (OLTP) • OLTP applications normally automate clerical data processing tasks of an organization, like data entry and enquiry, transaction handling, etc. (access, read, update) • Database is current, and consistency and recoverability are critical. Records are accessed one at a time. ¾OLTP operations are structured and repetitive ¾OLTP operations require detailed and up-to-date data ¾OLTP operations are short, atomic and isolated transactions •OLAP operations: DW tend to be in the order of Tb Databases tend to be hundreds of Mb to Gb. Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 17 Dr. Osmar R. Zaïane, 1999-2004 Time (Quarters) Minivan. Location Edmonton Calgary Jul Aug Sep Compact (city, AB)Lethbridge Mid-size Family Red Deer Category Minivan. Drill down on Q3 OLTP OLAP Clerk, IT professional Knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Roll-up on Location Time (Quarters) Q1 Q2 Q3 Q4 Maritimes Compact Quebec (province, Ontario Mid-size Prairies Canada)Western Pr Family Time (Quarters) access Category Minivan. Category Pivot Time (Quarters) er De Red ge id hbr Le t g a r y n l Ca onto m Ed Q1 Q2 Q3 Q4 Edmonton Location Calgary (city, AB)Lethbridge Mid-size Red Deer Dr. Osmar R. Zaïane, 1999-2004 usage Location Slice on Category=mid-size Q1 Q2 Q3 Q4 Location (city, AB) Mid-size Principles of Knowledge Discovery in Data University of Alberta users Time (Months, Q3) Category Principles of Knowledge Discovery in Data 18 OLTP vs OLAP Data Warehouse OLAP Example Q1 Q2 Q3 Q4 Edmonton Location Calgary Compact (city, AB)Lethbridge Mid-size Red Deer Family • roll-up (increase the level of abstraction) • drill-down (decrease the level of abstraction) • slice and dice (selection and projection) • pivot (re-orient the multi-dimensional view) • drill-through (links to the raw data) complex query Category (Source: JH) University of Alberta 19 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 20 Why Do We Separate DW From DB? • Performance reasons: • What is a data warehouse and what is it for? – OLAP necessitates special data organization that supports multidimensional views. – OLAP queries would degrade operational DB. – OLAP is read only. – No concurrency control and recovery. • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Decision support requires historical data. • Decision support requires consolidated data. Dr. Osmar R. Zaïane, 1999-2004 Data Warehouse and OLAP Outline Principles of Knowledge Discovery in Data University of Alberta • Are there issues related to data cube technology? • Can we mine data warehouses? 21 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data 22 University of Alberta Three-tier Architecture Three-tier Architecture Monitor & Integrator metadata External sources Operational DBs Extract Transform Load Refresh Data Sources Dr. Osmar R. Zaïane, 1999-2004 Operational DBs Extract Transform Load Refresh Dr. Osmar R. Zaïane, 1999 OLAP Server Analysis Serve Data Warehouse Query Reports Data mining Client Tools Data Marts Principles of Knowledge Discovery in Databases University of Alberta 23 • Data sources are often the operational systems, providing the lowest level of data. • Data sources are designed for operational use, not for decision support, and the data reflect this fact. • Multiple data sources are often from different systems run on a wide range of hardware and much of the software is built in-house or highly customized. • Multiple data sources introduce a large number of issues -- semantic conflicts. Analysis Serve Query Reports Data mining Client Tools Data sources External sources Data sources OLAP Server Data Warehouse Monitor & Integrator metadata Data Marts Principles of Knowledge Discovery in Data University of Alberta 23 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 24 Three-tier Architecture Data Cleaning Monitor & Integrator metadata External sources Operational DBs OLAP Server Three-tier Architecture Load and Refresh Analysis Serve Extract Transform Load Refresh Query Reports Data Warehouse Data mining Client Tools Data sources Principles of Knowledge Discovery in Databases University of Alberta Principles of Knowledge Discovery in Data Operational DBs Monitor External sources Operational DBs Extract Transform Load Refresh Data sources Dr. Osmar R. Zaïane, 1999 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data 23 26 University of Alberta Three-tier Architecture Monitor & Integrator metadata Analysis External sources Integrator Query Reports Data mining Client Tools Data Marts Principles of Knowledge Discovery in Databases University of Alberta • Data shipping: using triggers to update snapshot log table and propagate the updated data to the warehouse. • Transaction shipping: shipping the updates in the transaction log. Serve Data Warehouse Data Marts Principles of Knowledge Discovery in Databases – How to refresh. 25 OLAP Server Data mining Client Tools • Determined by usage, types of data source, etc. Three-tier Architecture Monitor & Integrator Analysis Query Reports Data Warehouse • Loading the warehouse includes some other processing tasks: – Checking integrity constraints, sorting, summarizing, build indices, etc. • Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse. – When to refresh. University of Alberta metadata OLAP Server Serve Extract Transform Load Refresh Dr. Osmar R. Zaïane, 1999 23 • Data cleaning is important to warehouse. – Operational data from multiple sources are often noisy (may contain data that is unnecessary for DS). • Three classes of tools. – Data migration: allows simple data transformation. – Data scrubbing: uses domain-specific knowledge to scrub data. – Data auditing: discovers rules and relationships by scanning data (detect outliers). Dr. Osmar R. Zaïane, 1999-2004 External sources Data sources Data Marts Dr. Osmar R. Zaïane, 1999 Monitor & Integrator metadata University of Alberta Operational DBs Extract Transform Load Refresh Data sources Dr. Osmar R. Zaïane, 1999 23 OLAP Server Analysis Serve Data Warehouse Query Reports Data mining Client Tools Data Marts Principles of Knowledge Discovery in Databases University of Alberta 23 • Receive changes from the monitors • Detect changes to an information source that are of interest to the warehouse. – Make the data conform to the conceptual schema used by the warehouse – Define triggers in a full-functionality DBMS. – Examine the updates in the log file. • Integrate the changes into the warehouse – Write programs for legacy systems. – Merge the data with existing data already present • Propagate the change in a generic form to the integrator. – Resolve possible update anomalies Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 27 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 28 Three-tier Architecture Monitor & Integrator metadata Metadata Repository External sources Operational DBs Extract Transform Load Refresh Data sources Dr. Osmar R. Zaïane, 1999 • Administrative metadata OLAP Server Three-tier Architecture Metadata Repository Serve Data Warehouse Query Reports Data mining Data Marts University of Alberta University of Alberta Dr. Osmar R. Zaïane, 1999 23 Data Warehouse Analysis Query Reports Data mining Client Tools Data Marts Principles of Knowledge Discovery in Databases University of Alberta 23 • Business data Source database and their contents Gateway descriptions Warehouse schema, view and derived data definitions Dimensions and hierarchies Pre-defined queries and reports Data mart locations and contents Data partitions Data extraction, cleansing, transformation rules, defaults – Data refresh and purge rules – User profiles, user groups – Security: user authorization, access control Principles of Knowledge Discovery in Data Operational DBs OLAP Server Serve Extract Transform Load Refresh Data sources – – – – – – – – Dr. Osmar R. Zaïane, 1999-2004 External sources Client Tools Principles of Knowledge Discovery in Databases Monitor & Integrator metadata Analysis – business terms and definitions – ownership of data – charging policies • Operational metadata – data lineage: history of migrated data and sequence of transformations applied – currency of data: active, archived, purged – monitoring information: warehouse usage statistics, error reports, audit trails 29 Data Warehouse and OLAP Outline Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 30 Data Warehouse Design Most data warehouses use a star schema to represent the multidimensional model. • What is a data warehouse and what is it for? Each dimension is represented by a dimension-table that describes it. • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? A fact-table connects to all dimension-tables with a multiple join. Each tuple in the fact-table consists of a pointer to each of the dimension-tables that provide its multi-dimensional coordinates and stores measures for those coordinates. • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Are there issues related to data cube technology? The links between the fact-table in the centre and the dimensiontables in the extremities form a shape like a star. (Star Schema) • Can we mine data warehouses? Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 31 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 32 Example of Star Schema Product Date Date Month Year Sales Fact Table Date Product Store StoreID City State Country Region Data Warehouses Design (con’t) • Modeling data warehouses: dimensions & measurements ProductNo ProdName ProdDesc Category Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) Store Each dimension is represented by one table ÎUn-normalized (introduces redundancy). Cust Customer unit_sales dollar_sales CustId CustName CustCity CustCountry Ex: (Edmonton, Alberta, Canada, North America) (Calgary, Alberta, Canada, North America) Normalize dimension tables Î Snowflake schema (Source: JH) Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 33 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data 34 University of Alberta Example of Snowflake Schema Data Warehouses Design (con’t) Year Year • Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. Product Month Month Year Date Sales Fact Table Date Month Date Product Store City • Fact constellations: Multiple fact tables share dimension tables. State Country City State State Country StoreID City ProductNo ProdName ProdDesc Category Store Cust Customer unit_sales dollar_sales CustId CustName CustCity CustCountry Country Region (Source: JH) Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 35 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 36 What Is the Best Design? Aggregation in Data Warehouses Multidimensional view of data in the warehouse: Stress on aggregation of measures by one or more dimensions Performance benchmarking can be used to determine what is the best design. Snowflake schema: Easier to maintain dimension tables when dimension table are very large (reduces overall space). The Data Cube and The Sub-Space Aggregates By City Star schema: More effective for data cube browsing (less joins): can affect performance. Group By Cross Tab Q1Q2Q3Q4 Category Aggregate By Category Drama Comedy Horror Drama Comedy Horror Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 37 Dr. Osmar R. Zaïane, 1999-2004 Construction of Multi-dimensional Data Cube By Time By Time & City Drama Comedy Horror By Category & City By Time Sum Sum Q3Q Red Deer 4 Q1 Q2 Lethbridge Calgary Edmonton By Time & Category Sum Sum By Category Principles of Knowledge Discovery in Data University of Alberta 38 A Star-Net Query Model Customer Orders Shipping Method Customer Amount B.C. Prairies Ontario sum Province 0-20K20-40K 40-60K60K- sum All Amount Algorithms, B.C. Contracts Air-Express Algorithms Truck Order Database … ... Product Line Discipline Time Product Annually Quarterly Daily sum Product Item Product Group City Sales Person Country District Region Division Geography Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 39 Dr. Osmar R. Zaïane, 1999-2004 Promotion Principles of Knowledge Discovery in Data Organization University of Alberta 40 Implementation of the OLAP Server Implementation of the OLAP Server ROLAP: Relational OLAP - data is stored in tables in relational database or extended-relational databases. They use an RDBMS to manage the warehouse data and aggregations using often a star schema. •They support extensions to SQL •A cell in the multi-dimensional structure is represented by a tuple. Advantage: Scalable (no empty cells for sparse cube). Disadvantage: no direct access to cells. MOLAP: Multidimensional OLAP – implements the multidimensional view by storing data in special multidimensional data structures (MDDS) Advantage: Fast indexing to pre-computed aggregations. Only values are stored. Disadvantage: Not very scalable and sparse Ex: Essbase of Arbor Ex: Microstrategy Metacube (Informix) HOLAP: Hybrid OLAP - combines ROLAP and MOLAP technology. (Scalability of ROLAP and faster computation of MOLAP) Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 41 View of Warehouses and Hierarchies with DBMiner Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 42 DBMiner Cube Visualization • Importing data • Table Browsing • Dimension creation • Dimension browsing • Cube building • Cube browsing Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 43 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 44 Example DIVE-ON Project Example DIVE-ON Project Construction 1 Communication Visualization Federated N-dimensional data warehouses Local 3-dimensional data cube Virtual Warehouse Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 45 Dr. Osmar R. Zaïane, 1999-2004 3 Interaction 4 CORBA+XML CORBA+XML Distributed databases 2 Principles of Knowledge Discovery in Data CAVE VCU Immersed user UIM University of Alberta 46 Data Warehouse and OLAP Outline Example DIVE-ON Project • What is a data warehouse and what is it for? Data Source 1 Schema Schema DCC Shell ORB ORBServer Server DBMS DBMS Interface Interface(Java (JavaAPI) API) Query Query Engine Engine DBMS DBMS • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? Query Query Executer Executer • How can we implement a data warehouse? Query QueryDistributor Distributor ORB ORBClient Client Dr. Osmar R. Zaïane, 1999-2004 Schema Schema • What is the multi-dimensional data model? DCC DCC Schema Schema ORB ORBServer Server SOAP SOAPServer Server Data Source 2 SOAP SOAPServer Server • Are there issues related to data cube technology? • Can we mine data warehouses? SOAP SOAPClient Client Principles of Knowledge Discovery in Data University of Alberta 47 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 48 Data Warehouse and OLAP Outline Issues • What is a data warehouse and what is it for? • • • • Scalability Sparseness Curse of dimensionality Materialization of the multidimensional data cube (total, virtual, partial) • Efficient computation of aggregations • Indexing Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Are there issues related to data cube technology? • Can we mine data warehouses? 49 Data Mining Data mining requires integrated, consistent and cleaned data which data warehouses can provide. Data mining tools can interface with the OLAP engine to take advantage of the integrated and aggregated data, as well as the navigation power. Interactive and exploratory mining. OLAP-based mining is referred to as OLAPmining or OLAM (on-line analytical mining). Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 51 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 50