Download data cubes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Week 3 lecture slides
❙ Topics
❙ Data Warehouses
❙ Online Analytical Processing
❙ Introduction to Data Cubes
❙ Textbook reference: Chapter 3
Data Warehouses
❙ A data warehouse is a collection of data specifically
designed for data mining activities.
❙ One such data mining activity is Online Analytical
Processing (OLAP).
❙ OLAP is interactive analysis of multidimensional data
stored as data cubes.
OLAP:
Online Analytical Processing
OLAP is a decision support system commonly associated
with Data Mining
❙
OLAP supports interactive complex queries on aggregations
of data:
e.g. sum, average, count
❙
Conceptually, aggregate data is stored as data cubes - also
known as a Multidimensional Database.
❙
Data Cubes are pre-calculated from individual data records
❙
Pre-calculation solves the slow response time normally
expected from complex query execution
❙
Information Reporting Systems
vs OLAP
❙
❙
❙
Routine or on-demand
operational reports are
fixed. An ad-hoc report
would require software
development
Database query systems
provide ad-hoc reporting
but are inefficient (slow)
for complex querying
OLAP queries on precalculated data cubes
handle a range of complex
queries in reasonable
response time.
Data Mining and OLAP
❙
❙
Data Mining investigates data in order to discover
actionable information
OLAP provides different summarized views of data and is
therefore a data mining tool, aiding the discovery of useful
information
Data Mining/OLAP synergy examples:
❙
OLAP may identify anomalies for further investigation by
other data mining techniques
❙
The attributes used in the upper nodes of decision trees are
chosen because they are the most predictive. Hence they
would be good choices for Data Cube dimensions
❙
The break points used on continuous data type attributes
(e.g. age < 20) indicate effective bin sizes for continuous OLAP
dimensions
Normalized Database Schema
❙
Normalized data is the
standard Relational
Database schema
technique
❙
Normalization is designed
to eliminate duplicated
(redundant) information
rather than for speed of
access
❙
It is not suited to OLAP
Star Schema
❙
❙
❙
OLAP systems favours schemas
that optimize access time.
The star schema has a fact table
and connected dimension tables
The dimensions are chosen
according to anticipated OLAP
queries, e.g.
For movies seen more than 5 times, how
often was each seen?
For what movies is the average viewer aged
over 30?
How many people of each gender have been
to each cinema?
The fact table contains whatever
key attributes make a fact (record)
unique.
❙
Other OLAP Schemas
❙
The snowflake schema uses fact and dimension tables, but
normalizes the dimensions. That is, dimension tables are split into
separate tables.
❙
The fact constellation schema has multiple fact tables that share
dimension tables.
Aggregation
❙
As well as “fact” data,
dimension tables also
require aggregate data
❙
Aggregations are chosen
in anticipation of queries
and may be partially precalculated.
❙
Partial aggregations are
stored by dimensions, in
data cubes
Data Cubes
❙
Conceptually, aggregated data is in the form of an n-Cube,
where n is the number of dimensions.
The size of each dimension determines the number of
subcubes.
E.g. 2 cinemas, 2 genders and 8 movies makes
❙
2 by 2 by 8 = 32 subcubes
Data Cubes
❙
Each subcube contains specific information. For example, the
subcube at the bottom left front is:
❙
cID=1,(cname=Belgrave)
mID=1,(mname=Moulin Rouge)
Gender=Male
contains aggregated data : # of viewers=56, Sum of ages of viewers=1456
OLAP Aggregate Queries
❙
Data Cubes support aggregate queries by dimension, e.g.
How many people of each gender have been to each cinema?
Could be answered by a table such as…
OLAP Aggregate Queries
❙
Tables are calculated by taking sections of the cube along
the required dimensions. Then calculating using the
information in the subcubes.
Multiple Views
❙
To solve a query like - How many people have seen Moulin
Rouge? We view the data cube by movie and calculate using the 4
relevant subcubes
❙
To solve a query like - What is the ratio of attendance at the same
movie at the two different cinemas? We view the data cube by
Movie and by Cinema.
❙
Thus storing partial aggregate data in subcubes supports multiple
views of the aggregate data.
Concept Hierarchies
❙ Concept Hierarchies provide summarization at different
levels of a dimension.
For example: The Time dimension might be:
Year -> Quarter -> Month -> Week -> Day
Example two: A location dimension might be:
Country -> Region -> City
So that we could view data summarised at each of these
levels of detail.
OLAP Operations
Rollup – View data in more summarised form
Drilldown – View data in more detailed form
Slice – View data along part of one dimension
Dice – View data along parts of two or more
dimensions
❙ Pivot – View data from different orientations
❙
❙
❙
❙
Rollup
❙
Rollup is an OLAP tool that
effectively summarizes the view by
combining subcubes.
Example: We are currently viewing the
data by cinema and by gender
A summarization is to view the data
regardless of gender. That is,
gender is combined
❙
Rollup decreases the level of detail
provided in the view.
Drilldown
❙
Drilldown is an OLAP tool that
expands the view by splitting
along a dimension. It is the
opposite of rollup
Example: We are currently viewing
the data by cinema.
We can drilldown to a view the
data that includes gender. That
is, the data cube is split along
the gender dimension.
❙
Drilldown increases the level of
detail provided in the view. The
most detailed view is individual
records (the fact table).
Rollup / Drilldown Through Concept
Hierarchies
❙ Rollup can also collapse
data to higher levels of the
concept hierarchy.
❙ Drilldown expands to lower
levels in the concept
hierarchy.
Slice
❙ A slice is a selection on one dimension of a cube
Example: The cube with Cinema=Belgrave
Dice
❙ A Dice is a section of the cube, for example:
Total people who have been to see Moulin Rouge and
The Well.
Pivot
❙ A pivot is the same cube
viewd from a different
orientation.
Example:
Cinema by Gender
or
Gender by Cinema
Data Warehouse Tools
Data Warehouse Systems may include tools to support
warehouse setup, maintenance and usage (data mining)
❙
Back-end Tools: for extraction, cleaning, transformation,
refreshing etc
❙
Front-end Tools:
❙
To support specific tasks, e.g
multidimensional views (aggregations by attributes)
rollup (summarize)
drilldown (detail)
❙
Extended SQL queries, e.g.
statistical analysis (mean, standard deviation..)
time window operations (moving average..)
comparison operations
Data Hierarchy
❙
Metadata is the logical
view
❙
The Database Schema is
the way the data is
physically stored
❙
Transformed data is the
data configured for data
mining purposes
❙
Source data is the
corporate operational data
Metadata
Metadata is data about data. The metadata contains the
❙
Business Model. The description of the data that is
presented to the users - entities, relationships, attributes.
The users’ view may be different for different user groups
❙
Administration Model. How the data is derived - the
source, extraction method, required transformation, when
it should be updated, the current status (current, out-ofdate..), user authorization and access control, where the
data is stored.
❙
Operational Model. Information about the usage of data usage statistics, error reports, audit trails.
Data Warehouse Architectures
❙
Middleware is an interfacing system to allow user access to
disparate source systems
❙
The Data Warehouse system may be a large Database
System with an extended query language to support data
mining
ROLAP versus MOLAP
❙
ROLAP (Relational Online Analytical Processing) is OLAP on
a standard relational database platform
❙
❙
❙
❙
❙
Star schemas are used to support OLAP operations
Standard SQL is used to generate views
The Decision Support Environment takes advantage of well
developed security, concurrent and maintenance features of
relational database technology
BUT..relational databases come with overhead designed to
support OLTP, that can make OLAP activities inefficient.
MOLAP (Multidimensional OLAP) is a designed specifically
for OLAP and not based on a relational database
❙
❙
Aggregated data is stored in multidimensional array
structures.
Specialized tools are used to generate views
Data Marts
❙
Data Marts, mini data warehouses designed for specific
groups of users (e.g. Departments).
❙
Supports different logical views among different user
groups
Keeps each data warehouse of manageable proportions
Increased efficiency/response time as operations are on
smaller data warehouses
Not all source system data is relevant to all user groups.
❙
❙
❙
Multilayered Architecture
❙
A multilayered architecture incorporates most architectural
features
❙
Source extraction is a regular process (as new data is
gathered)
Data mining also creates data that can be stored for future
use.
The Central Repository is ideally scalable so as to grow as
data and user demand requires. It may use tertiary storage
technology and parallel processing units to serve data to
the data marts
❙
❙
Example: Indonesian Ministry of
Education
❙
Subgroups of the Ministry with overlapping interests
❙
❙
❙
❙
❙
❙
❙
❙
❙
Directorate of Primary Education
Directorate of General Secondary Education
Directorate of Private Schooling
Directorate of Vocational Education
Directorate of Higher Education
Coordinating Body for Private Universities
General Directorate of Informal Education, Youth and Sport
Recommended architecture: Integrated Data Marts forming
a Distributed Data Warehouse. Standardised client-server
interface for all Data Marts
Deborah Wyburn “Decision Support Systems in the MOEC,
Indonesia”, Master of Computing Studies Thesis, 2001