Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP207: Data Mining COMP207: Data Mining Dr. M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2010 Data Warehousing COMP207: Data Mining Today's Topics Data Warehouses Data Cubes Warehouse Schemas OLAP Materialisation Data Warehousing COMP207: Data Mining What is a Data Warehouse? Most common definition: “A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision-making process.” - W. H. Inmon Corporate focused, assumes a lot of data, and typically sales related Data for “Decision Support System” or “Management Support System” 1996 survey: Return on Investment of 400+% Data Warehousing: Process of constructing (and using) a data warehouse Data Warehousing COMP207: Data Mining Subject-oriented: Focused on important subjects, not transactions Data Warehouse Concise view with only useful data for decision making Integrated: Constructed from multiple, heterogeneous data sources. Normally distributed relational databases, not necessarily same schema. Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent data (sounds familiar, I hope) Data Warehousing COMP207: Data Mining Data Warehouse Time-variant: Has different values for the same fields over time. Operational database only has current value. Data Warehouse offers historical values. Nonvolatile: Physically separate store Updates not online, but in offline batch mode only Read only access required, so no concurrency issues Data Warehousing COMP207: Data Mining Data Warehouse Data Warehouses are distinct from: Distributed DB: Integrated via wrappers/mediators. Far too slow, semantic integration much more complicated. Integration done before loading, not at run time. Operational DB: Only records current value, lots of extra non useful information. Different schemas/models, access patterns, users, functions, even though the data is derived from an operational db. Data Warehousing COMP207: Data Mining OLAP vs OLTP OLAP: Online Analytical Processing (Data Warehouse) OLTP: Online Transaction Processing (Traditional DBMS) OLAP data typically: historical, consolidated, and multidimensional (eg: product, time, location). Involves lots of full database scans, across terabytes or more of data. Typically aggregation and summarisation functions. Distinctly different uses to OLTP on the operational database. Data Warehousing COMP207: Data Mining Data Cubes Data is normally Multi-Dimensional, and can be thought of as a cube. Often: 3 dimensions of time, location and product. No need to have just 3 dimensions -- could have one for cars with make, colour, price, location, and time for example. Image courtesy of IBM OLAP Miner documentation Data Warehousing COMP207: Data Mining Data Cubes Can construct many 'cuboids' from the full cube by excluding dimensions. In an N dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional cuboid (other than non existent!) is called the 'apex cuboid'. Can think of this as a lattice of cuboids... (Following lattice courtesy of Han & Kamber) Data Warehousing COMP207: Data Mining Lattice of Cuboids all time item 0-D(apex) cuboid location supplier 1-D cuboids time,item time,location item,location time,supplier time,item,location location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier Data Warehousing COMP207: Data Mining Multi-dimensional Units Each dimension can also be thought of in terms of different units. Time: decade, year, quarter, month, day, hour (and week, which isn't strictly hierarchical with the others!) Location: continent, country, state, city, store Product: electronics, computer, laptop, dell, inspiron This is called a “Star-Net” model in data warehousing, and allows for various operations on the dimensions and the resulting cuboids. Data Warehousing COMP207: Data Mining Star-Net Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS TRUCK ORDER PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP DISTRICT SALES PERSON REGION DISTRICT COUNTRY DIVISION Geography Promotion Data Warehousing Organization COMP207: Data Mining Data Cube Operations Roll Up: Summarise data by climbing up hierarchy. Eg. From monthly to quarterly, from Liverpool to England Drill Down: Opposite of Roll Up Eg. From computer to laptop, from £100-999 to £100-199 Slice: Remove a dimension by setting a value for it Eg. location/product where time is Q1,2007 Dice: Restrict cube by setting values for multiple dimensions Eg. Q1,Q2 / North American cities / 3 products sub cube Pivot: Rotate the cube (mostly for visualisation) Data Warehousing COMP207: Data Mining Data Cube Schemas Star Schema: Single fact table in the middle, with connected set of dimension tables (Hence a star) Snowflake Schema: Some of the dimension tables further refined into smaller dimension tables (Hence looks like a snow flake) Fact Constellation: Multiple fact tables can share dimension tables (Hence looks like a collection of star schemas. Also called Galaxy Schema) Data Warehousing COMP207: Data Mining Star Schema Time Dimension Item Dimension time_key day day_of_week month quarter year item_key name brand type supplier_type Sales Fact Table time_key item_key location_key units_sold Loc.n Dimension location_key street city state country continent Measure (value) Data Warehousing COMP207: Data Mining Snowflake Schema Time Dimension Item Dimension time_key day day_of_week month quarter year item_key name brand type supplier_key Sales Fact Table time_key item_key location_key units_sold Loc Dimension location_key street city_key City Dimension city_key city state country Measure (value) Data Warehousing COMP207: Data Mining Fact Constellation Time Dimension Item Dimension time_key day day_of_week month quarter year item_key name brand type supplier_key Sales Fact Table time_key item_key Shipping Table time_key item_key from_key location_key units_shipped units_sold Loc Dimension location_key street city_key City Dimension city_key city state country Measure (value) Data Warehousing COMP207: Data Mining OLAP Server Types ROLAP: Relational OLAP Uses relational DBMS to store and manage the warehouse data Optimised for non traditional access patterns Lots of research into RDBMS to make use of! MOLAP: Multidimensional OLAP Sparse array based storage engine Fast access to precomputed data HOLAP: Hybrid OLAP Mixture of both MOLAP and ROLAP Data Warehousing COMP207: Data Mining Data Warehouse Architecture (also courtesy of Han & Kamber) Other sources Operational DBs Metadata Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Data Warehousing Front-End Tools COMP207: Data Mining Materialisation In order to compute OLAP queries efficiently, need to materialise some of the cuboids from the data. None: Very slow, as need to compute entire cube at run time Full: Very fast, but requires a LOT of storage space and time to compute all possible cuboids Partial: But which ones to materialise? Called an 'iceberg cube', as only partially materialised and the rest is "below water". Many cells in a cuboid will be empty, only materialise sections that contain more values than a minimum threshold. Data Warehousing COMP207: Data Mining Further Reading http://en.wikipedia.org/wiki/Data_warehouse and subsequent links Data Warehousing