Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Week 3 lecture slides ❙ Topics ❙ Data Warehouses ❙ Online Analytical Processing ❙ Introduction to Data Cubes ❙ Textbook reference: Chapter 3 Data Warehouses ❙ A data warehouse is a collection of data specifically designed for data mining activities. ❙ One such data mining activity is Online Analytical Processing (OLAP). ❙ OLAP is interactive analysis of multidimensional data stored as data cubes. OLAP: Online Analytical Processing OLAP is a decision support system commonly associated with Data Mining ❙ OLAP supports interactive complex queries on aggregations of data: e.g. sum, average, count ❙ Conceptually, aggregate data is stored as data cubes - also known as a Multidimensional Database. ❙ Data Cubes are pre-calculated from individual data records ❙ Pre-calculation solves the slow response time normally expected from complex query execution ❙ Information Reporting Systems vs OLAP ❙ ❙ ❙ Routine or on-demand operational reports are fixed. An ad-hoc report would require software development Database query systems provide ad-hoc reporting but are inefficient (slow) for complex querying OLAP queries on precalculated data cubes handle a range of complex queries in reasonable response time. Data Mining and OLAP ❙ ❙ Data Mining investigates data in order to discover actionable information OLAP provides different summarized views of data and is therefore a data mining tool, aiding the discovery of useful information Data Mining/OLAP synergy examples: ❙ OLAP may identify anomalies for further investigation by other data mining techniques ❙ The attributes used in the upper nodes of decision trees are chosen because they are the most predictive. Hence they would be good choices for Data Cube dimensions ❙ The break points used on continuous data type attributes (e.g. age < 20) indicate effective bin sizes for continuous OLAP dimensions Normalized Database Schema ❙ Normalized data is the standard Relational Database schema technique ❙ Normalization is designed to eliminate duplicated (redundant) information rather than for speed of access ❙ It is not suited to OLAP Star Schema ❙ ❙ ❙ OLAP systems favours schemas that optimize access time. The star schema has a fact table and connected dimension tables The dimensions are chosen according to anticipated OLAP queries, e.g. For movies seen more than 5 times, how often was each seen? For what movies is the average viewer aged over 30? How many people of each gender have been to each cinema? The fact table contains whatever key attributes make a fact (record) unique. ❙ Other OLAP Schemas ❙ The snowflake schema uses fact and dimension tables, but normalizes the dimensions. That is, dimension tables are split into separate tables. ❙ The fact constellation schema has multiple fact tables that share dimension tables. Aggregation ❙ As well as “fact” data, dimension tables also require aggregate data ❙ Aggregations are chosen in anticipation of queries and may be partially precalculated. ❙ Partial aggregations are stored by dimensions, in data cubes Data Cubes ❙ Conceptually, aggregated data is in the form of an n-Cube, where n is the number of dimensions. The size of each dimension determines the number of subcubes. E.g. 2 cinemas, 2 genders and 8 movies makes ❙ 2 by 2 by 8 = 32 subcubes Data Cubes ❙ Each subcube contains specific information. For example, the subcube at the bottom left front is: ❙ cID=1,(cname=Belgrave) mID=1,(mname=Moulin Rouge) Gender=Male contains aggregated data : # of viewers=56, Sum of ages of viewers=1456 OLAP Aggregate Queries ❙ Data Cubes support aggregate queries by dimension, e.g. How many people of each gender have been to each cinema? Could be answered by a table such as… OLAP Aggregate Queries ❙ Tables are calculated by taking sections of the cube along the required dimensions. Then calculating using the information in the subcubes. Multiple Views ❙ To solve a query like - How many people have seen Moulin Rouge? We view the data cube by movie and calculate using the 4 relevant subcubes ❙ To solve a query like - What is the ratio of attendance at the same movie at the two different cinemas? We view the data cube by Movie and by Cinema. ❙ Thus storing partial aggregate data in subcubes supports multiple views of the aggregate data. Concept Hierarchies ❙ Concept Hierarchies provide summarization at different levels of a dimension. For example: The Time dimension might be: Year -> Quarter -> Month -> Week -> Day Example two: A location dimension might be: Country -> Region -> City So that we could view data summarised at each of these levels of detail. OLAP Operations Rollup – View data in more summarised form Drilldown – View data in more detailed form Slice – View data along part of one dimension Dice – View data along parts of two or more dimensions ❙ Pivot – View data from different orientations ❙ ❙ ❙ ❙ Rollup ❙ Rollup is an OLAP tool that effectively summarizes the view by combining subcubes. Example: We are currently viewing the data by cinema and by gender A summarization is to view the data regardless of gender. That is, gender is combined ❙ Rollup decreases the level of detail provided in the view. Drilldown ❙ Drilldown is an OLAP tool that expands the view by splitting along a dimension. It is the opposite of rollup Example: We are currently viewing the data by cinema. We can drilldown to a view the data that includes gender. That is, the data cube is split along the gender dimension. ❙ Drilldown increases the level of detail provided in the view. The most detailed view is individual records (the fact table). Rollup / Drilldown Through Concept Hierarchies ❙ Rollup can also collapse data to higher levels of the concept hierarchy. ❙ Drilldown expands to lower levels in the concept hierarchy. Slice ❙ A slice is a selection on one dimension of a cube Example: The cube with Cinema=Belgrave Dice ❙ A Dice is a section of the cube, for example: Total people who have been to see Moulin Rouge and The Well. Pivot ❙ A pivot is the same cube viewd from a different orientation. Example: Cinema by Gender or Gender by Cinema Data Warehouse Tools Data Warehouse Systems may include tools to support warehouse setup, maintenance and usage (data mining) ❙ Back-end Tools: for extraction, cleaning, transformation, refreshing etc ❙ Front-end Tools: ❙ To support specific tasks, e.g multidimensional views (aggregations by attributes) rollup (summarize) drilldown (detail) ❙ Extended SQL queries, e.g. statistical analysis (mean, standard deviation..) time window operations (moving average..) comparison operations Data Hierarchy ❙ Metadata is the logical view ❙ The Database Schema is the way the data is physically stored ❙ Transformed data is the data configured for data mining purposes ❙ Source data is the corporate operational data Metadata Metadata is data about data. The metadata contains the ❙ Business Model. The description of the data that is presented to the users - entities, relationships, attributes. The users’ view may be different for different user groups ❙ Administration Model. How the data is derived - the source, extraction method, required transformation, when it should be updated, the current status (current, out-ofdate..), user authorization and access control, where the data is stored. ❙ Operational Model. Information about the usage of data usage statistics, error reports, audit trails. Data Warehouse Architectures ❙ Middleware is an interfacing system to allow user access to disparate source systems ❙ The Data Warehouse system may be a large Database System with an extended query language to support data mining ROLAP versus MOLAP ❙ ROLAP (Relational Online Analytical Processing) is OLAP on a standard relational database platform ❙ ❙ ❙ ❙ ❙ Star schemas are used to support OLAP operations Standard SQL is used to generate views The Decision Support Environment takes advantage of well developed security, concurrent and maintenance features of relational database technology BUT..relational databases come with overhead designed to support OLTP, that can make OLAP activities inefficient. MOLAP (Multidimensional OLAP) is a designed specifically for OLAP and not based on a relational database ❙ ❙ Aggregated data is stored in multidimensional array structures. Specialized tools are used to generate views Data Marts ❙ Data Marts, mini data warehouses designed for specific groups of users (e.g. Departments). ❙ Supports different logical views among different user groups Keeps each data warehouse of manageable proportions Increased efficiency/response time as operations are on smaller data warehouses Not all source system data is relevant to all user groups. ❙ ❙ ❙ Multilayered Architecture ❙ A multilayered architecture incorporates most architectural features ❙ Source extraction is a regular process (as new data is gathered) Data mining also creates data that can be stored for future use. The Central Repository is ideally scalable so as to grow as data and user demand requires. It may use tertiary storage technology and parallel processing units to serve data to the data marts ❙ ❙ Example: Indonesian Ministry of Education ❙ Subgroups of the Ministry with overlapping interests ❙ ❙ ❙ ❙ ❙ ❙ ❙ ❙ ❙ Directorate of Primary Education Directorate of General Secondary Education Directorate of Private Schooling Directorate of Vocational Education Directorate of Higher Education Coordinating Body for Private Universities General Directorate of Informal Education, Youth and Sport Recommended architecture: Integrated Data Marts forming a Distributed Data Warehouse. Standardised client-server interface for all Data Marts Deborah Wyburn “Decision Support Systems in the MOEC, Indonesia”, Master of Computing Studies Thesis, 2001