Download 4.1 data warehouses

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
LECTURE 4:
DATA WAREHOUSING
4.1 DATA WAREHOUSES
Most common definition:
“A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision-making process.” - W. H. Inmon
 Corporate focused, assumes a lot of data, and typically sales related
 Data for “Decision Support System” or “Management Support System”
 1996 survey: Return on Investment of 400+%
Data Warehousing: Process of constructing (and using) a data warehouse
Subject-oriented:
●
Focused on important subjects, not transactions
●
Concise view with only useful data for decision making
Integrated:
●
Constructed from multiple, heterogeneous data sources. Normally distributed
relational databases, not necessarily same schema.
●
Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent
data (sounds familiar, I hope)
Time-variant:
●
Has different values for the same fields over time.
●
Operational database only has current value. Data Warehouse offers historical values.
Non-volatile:
●
Physically separate store
●
Updates not online, but in offline batch mode only
●
Read only access required, so no concurrency issues
Data Warehouses are distinct from:
Distributed DB:
● Integrated via wrappers/mediators.
● Far too slow, semantic integration much more complicated.
● Integration done before loading, not at run time.
Operational DB:
●
Only records current value, lots of extra non useful information such as HR.
● Different schemas/models, access patterns, users, functions, even though the data is
derived from an operational db.
@ St. Paul’s University
1
OLAP vs OLTP
OLAP: Online Analytical Processing (Data Warehouse)
OLTP: Online Transaction Processing (Traditional DBMS)
OLAP data typically:
 historical, consolidated, and multi-dimensional (eg: product, time, location).
 Involves lots of full database scans, across terabytes or more of data.
 Typically aggregation and summarisation functions.
OLTP different on the operational database

Data is normally Multi-Dimensional, and can be thought of as a cube.

Often: 3 dimensions of time, location and product.

No need to have just 3 dimensions -- could have one for cars with make, colour, price,
location, and time for example.
@ St. Paul’s University
2
4.2 DATA CUBES
Can construct many 'cuboids' from the full cube by excluding dimensions. In an N
dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional
cuboid (other than non existent!) is called the 'apex cuboid'. Can think of this as a lattice of
cuboids...
@ St. Paul’s University
3
Lattice of Cuboids
Multi-dimensional Units
Each dimension can also be thought of in terms of different units.

ime: decade, year, quarter, month, day, hour
 week, which isn't strictly hierarchical with the others!)
 Location: continent, country, state, city, store
 Product: electronics, computer, laptop, dell, inspiron
This is called a “Star-Net” model in data warehousing, and allows for various operations on
the dimensions and the resulting cuboids.
@ St. Paul’s University
4
Star-Net Model
Data Cube Operations
1. Roll Up: Summarise data by climbing up hierarchy. Eg. From monthly to quarterly,
from Liverpool to England
2. Drill Down: Opposite of Roll Up. Eg. From computer to laptop, from £100-199 to
£100-999
3. Slice: Remove a dimension by setting a value for it Eg. location/product where time is
Q1,2007
4. Dice: Restrict cube by setting values for multiple dimensions Eg. Q1,Q2 / North
American cities / 3 products sub cube
5. Pivot: Rotate the cube (mostly for visualisation)
Data Cube Schemas
 Star Schema: Single fact table in the middle, with connected set of dimension tables
(Hence a star)
 Snowflake Schema: Some of the dimension tables further refined into smaller
dimension tables (Hence looks like a snow flake)
 Fact Constellation: Multiple fact tables can share dimension tables (Hence looks
like a collection of star schemas. Also called Galaxy Schema)
@ St. Paul’s University
5
Star Schema
Snowflake Schema
@ St. Paul’s University
6
Fact Constellation
4.4 OLAP SERVER TYPES
ROLAP:
 Relational OLAP

Uses relational DBMS to store and manage the warehouse data

Optimised for non traditional access patterns
 Lots of research into RDBMS to make use of!
MOLAP:
 Multidimensional OLAP
 Sparse array based storage engine
 Fast access to precomputed data
HOLAP:
 Hybrid OLAP

Mixture of both MOLAP and ROLAP
@ St. Paul’s University
7
Data Warehouse Architecture
4.5 MATERIALISATION
In order to compute OLAP queries efficiently, need to materialise some of
the cuboids from the data.

None: Very slow, as need to compute entire cube at run time

Full: Very fast, but requires a LOT of storage space and time to compute all
possible cuboids

Partial: But which ones to materialise? Called an 'iceberg cube', as only partially
materialised and the rest is "below water".
Many cells in a cuboid will be empty, only materialise sections that contain more values than
a minimum threshold.
@ St. Paul’s University
8