Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LECTURE 4: DATA WAREHOUSING 4.1 DATA WAREHOUSES Most common definition: “A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision-making process.” - W. H. Inmon Corporate focused, assumes a lot of data, and typically sales related Data for “Decision Support System” or “Management Support System” 1996 survey: Return on Investment of 400+% Data Warehousing: Process of constructing (and using) a data warehouse Subject-oriented: ● Focused on important subjects, not transactions ● Concise view with only useful data for decision making Integrated: ● Constructed from multiple, heterogeneous data sources. Normally distributed relational databases, not necessarily same schema. ● Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent data (sounds familiar, I hope) Time-variant: ● Has different values for the same fields over time. ● Operational database only has current value. Data Warehouse offers historical values. Non-volatile: ● Physically separate store ● Updates not online, but in offline batch mode only ● Read only access required, so no concurrency issues Data Warehouses are distinct from: Distributed DB: ● Integrated via wrappers/mediators. ● Far too slow, semantic integration much more complicated. ● Integration done before loading, not at run time. Operational DB: ● Only records current value, lots of extra non useful information such as HR. ● Different schemas/models, access patterns, users, functions, even though the data is derived from an operational db. @ St. Paul’s University 1 OLAP vs OLTP OLAP: Online Analytical Processing (Data Warehouse) OLTP: Online Transaction Processing (Traditional DBMS) OLAP data typically: historical, consolidated, and multi-dimensional (eg: product, time, location). Involves lots of full database scans, across terabytes or more of data. Typically aggregation and summarisation functions. OLTP different on the operational database Data is normally Multi-Dimensional, and can be thought of as a cube. Often: 3 dimensions of time, location and product. No need to have just 3 dimensions -- could have one for cars with make, colour, price, location, and time for example. @ St. Paul’s University 2 4.2 DATA CUBES Can construct many 'cuboids' from the full cube by excluding dimensions. In an N dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional cuboid (other than non existent!) is called the 'apex cuboid'. Can think of this as a lattice of cuboids... @ St. Paul’s University 3 Lattice of Cuboids Multi-dimensional Units Each dimension can also be thought of in terms of different units. ime: decade, year, quarter, month, day, hour week, which isn't strictly hierarchical with the others!) Location: continent, country, state, city, store Product: electronics, computer, laptop, dell, inspiron This is called a “Star-Net” model in data warehousing, and allows for various operations on the dimensions and the resulting cuboids. @ St. Paul’s University 4 Star-Net Model Data Cube Operations 1. Roll Up: Summarise data by climbing up hierarchy. Eg. From monthly to quarterly, from Liverpool to England 2. Drill Down: Opposite of Roll Up. Eg. From computer to laptop, from £100-199 to £100-999 3. Slice: Remove a dimension by setting a value for it Eg. location/product where time is Q1,2007 4. Dice: Restrict cube by setting values for multiple dimensions Eg. Q1,Q2 / North American cities / 3 products sub cube 5. Pivot: Rotate the cube (mostly for visualisation) Data Cube Schemas Star Schema: Single fact table in the middle, with connected set of dimension tables (Hence a star) Snowflake Schema: Some of the dimension tables further refined into smaller dimension tables (Hence looks like a snow flake) Fact Constellation: Multiple fact tables can share dimension tables (Hence looks like a collection of star schemas. Also called Galaxy Schema) @ St. Paul’s University 5 Star Schema Snowflake Schema @ St. Paul’s University 6 Fact Constellation 4.4 OLAP SERVER TYPES ROLAP: Relational OLAP Uses relational DBMS to store and manage the warehouse data Optimised for non traditional access patterns Lots of research into RDBMS to make use of! MOLAP: Multidimensional OLAP Sparse array based storage engine Fast access to precomputed data HOLAP: Hybrid OLAP Mixture of both MOLAP and ROLAP @ St. Paul’s University 7 Data Warehouse Architecture 4.5 MATERIALISATION In order to compute OLAP queries efficiently, need to materialise some of the cuboids from the data. None: Very slow, as need to compute entire cube at run time Full: Very fast, but requires a LOT of storage space and time to compute all possible cuboids Partial: But which ones to materialise? Called an 'iceberg cube', as only partially materialised and the rest is "below water". Many cells in a cuboid will be empty, only materialise sections that contain more values than a minimum threshold. @ St. Paul’s University 8