Download 4.1 data warehouses

LECTURE 4: DATA WAREHOUSING 4.1 DATA WAREHOUSES Most common definition: “A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision-making process.” - W. H. Inmon  Corporate focused, assumes a lot of data, and typically sales related  Data for “Decision Support System” or “Management Support System”  1996 survey: Return on Investment of 400+% Data Warehousing: Process of constructing (and using) a data warehouse Subject-oriented: ● Focused on important subjects, not transactions ● Concise view with only useful data for decision making Integrated: ● Constructed from multiple, heterogeneous data sources. Normally distributed relational databases, not necessarily same schema. ● Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent data (sounds familiar, I hope) Time-variant: ● Has different values for the same fields over time. ● Operational database only has current value. Data Warehouse offers historical values. Non-volatile: ● Physically separate store ● Updates not online, but in offline batch mode only ● Read only access required, so no concurrency issues Data Warehouses are distinct from: Distributed DB: ● Integrated via wrappers/mediators. ● Far too slow, semantic integration much more complicated. ● Integration done before loading, not at run time. Operational DB: ● Only records current value, lots of extra non useful information such as HR. ● Different schemas/models, access patterns, users, functions, even though the data is derived from an operational db. @ St. Paul’s University 1 OLAP vs OLTP OLAP: Online Analytical Processing (Data Warehouse) OLTP: Online Transaction Processing (Traditional DBMS) OLAP data typically:  historical, consolidated, and multi-dimensional (eg: product, time, location).  Involves lots of full database scans, across terabytes or more of data.  Typically aggregation and summarisation functions. OLTP different on the operational database  Data is normally Multi-Dimensional, and can be thought of as a cube.  Often: 3 dimensions of time, location and product.  No need to have just 3 dimensions -- could have one for cars with make, colour, price, location, and time for example. @ St. Paul’s University 2 4.2 DATA CUBES Can construct many 'cuboids' from the full cube by excluding dimensions. In an N dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional cuboid (other than non existent!) is called the 'apex cuboid'. Can think of this as a lattice of cuboids... @ St. Paul’s University 3 Lattice of Cuboids Multi-dimensional Units Each dimension can also be thought of in terms of different units.  ime: decade, year, quarter, month, day, hour  week, which isn't strictly hierarchical with the others!)  Location: continent, country, state, city, store  Product: electronics, computer, laptop, dell, inspiron This is called a “Star-Net” model in data warehousing, and allows for various operations on the dimensions and the resulting cuboids. @ St. Paul’s University 4 Star-Net Model Data Cube Operations 1. Roll Up: Summarise data by climbing up hierarchy. Eg. From monthly to quarterly, from Liverpool to England 2. Drill Down: Opposite of Roll Up. Eg. From computer to laptop, from £100-199 to £100-999 3. Slice: Remove a dimension by setting a value for it Eg. location/product where time is Q1,2007 4. Dice: Restrict cube by setting values for multiple dimensions Eg. Q1,Q2 / North American cities / 3 products sub cube 5. Pivot: Rotate the cube (mostly for visualisation) Data Cube Schemas  Star Schema: Single fact table in the middle, with connected set of dimension tables (Hence a star)  Snowflake Schema: Some of the dimension tables further refined into smaller dimension tables (Hence looks like a snow flake)  Fact Constellation: Multiple fact tables can share dimension tables (Hence looks like a collection of star schemas. Also called Galaxy Schema) @ St. Paul’s University 5 Star Schema Snowflake Schema @ St. Paul’s University 6 Fact Constellation 4.4 OLAP SERVER TYPES ROLAP:  Relational OLAP  Uses relational DBMS to store and manage the warehouse data  Optimised for non traditional access patterns  Lots of research into RDBMS to make use of! MOLAP:  Multidimensional OLAP  Sparse array based storage engine  Fast access to precomputed data HOLAP:  Hybrid OLAP  Mixture of both MOLAP and ROLAP @ St. Paul’s University 7 Data Warehouse Architecture 4.5 MATERIALISATION In order to compute OLAP queries efficiently, need to materialise some of the cuboids from the data.  None: Very slow, as need to compute entire cube at run time  Full: Very fast, but requires a LOT of storage space and time to compute all possible cuboids  Partial: But which ones to materialise? Called an 'iceberg cube', as only partially materialised and the rest is "below water". Many cells in a cuboid will be empty, only materialise sections that contain more values than a minimum threshold. @ St. Paul’s University 8

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 4.1 data warehouses