Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2009 This is the full course notes, but not quite complete. You should come to the lectures anyway. Really. Data Warehousing February 04, 2009 Slide 1 COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Data Warehousing Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam February 04, 2009 Slide 2 COMP527: Data Mining Today's Topics Data Warehouses Data Cubes Warehouse Schemas OLAP Materialisation Data Warehousing February 04, 2009 Slide 3 COMP527: Data Mining What is a Data Warehouse? Most common definition: “A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision-making process.” - W. H. Inmon Corporate focused, assumes a lot of data, and typically sales related Data for “Decision Support System” or “Management Support System” 1996 survey: Return on Investment of 400+% Data Warehousing: Process of constructing (and using) a data warehouse Data Warehousing February 04, 2009 Slide 4 COMP527: Data Mining Subject-oriented: Focused on important subjects, not transactions Data Warehouse Concise view with only useful data for decision making Integrated: Constructed from multiple, heterogeneous data sources. Normally distributed relational databases, not necessarily same schema. Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent data (sounds familiar, I hope) Data Warehousing February 04, 2009 Slide 5 COMP527: Data Mining Data Warehouse Time-variant: Has different values for the same fields over time. Operational database only has current value. Data Warehouse offers historical values. Nonvolatile: Physically separate store Updates not online, but in offline batch mode only Read only access required, so no concurrency issues Data Warehousing February 04, 2009 Slide 6 COMP527: Data Mining Data Warehouse Data Warehouses are distinct from: Distributed DB: Integrated via wrappers/mediators. Far too slow, semantic integration much more complicated. Integration done before loading, not at run time. Operational DB: Only records current value, lots of extra non useful information such as HR. Different schemas/models, access patterns, users, functions, even though the data is derived from an operational db. Data Warehousing February 04, 2009 Slide 7 COMP527: Data Mining OLAP vs OLTP OLAP: Online Analytical Processing (Data Warehouse) OLTP: Online Transaction Processing (Traditional DBMS) OLAP data typically: historical, consolidated, and multidimensional (eg: product, time, location). Involves lots of full database scans, across terabytes or more of data. Typically aggregation and summarisation functions. Distinctly different uses to OLTP on the operational database. Data Warehousing February 04, 2009 Slide 8 COMP527: Data Mining Data Cubes Data is normally Multi-Dimensional, and can be thought of as a cube. Often: 3 dimensions of time, location and product. No need to have just 3 dimensions -- could have one for cars with make, colour, price, location, and time for example. Image courtesy of IBM OLAP Miner documentation Data Warehousing February 04, 2009 Slide 9 COMP527: Data Mining Data Cubes Can construct many 'cuboids' from the full cube by excluding dimensions. In an N dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional cuboid (other than non existent!) is called the 'apex cuboid'. Can think of this as a lattice of cuboids... (Following lattice courtesy of Han & Kamber) Data Warehousing February 04, 2009 Slide 10 COMP527: Data Mining Lattice of Cuboids all time item 0-D(apex) cuboid location supplier 1-D cuboids time,item time,location item,location time,supplier time,item,location location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier Data Warehousing February 04, 2009 Slide 11 COMP527: Data Mining Multi-dimensional Units Each dimension can also be thought of in terms of different units. Time: decade, year, quarter, month, day, hour (and week, which isn't strictly hierarchical with the others!) Location: continent, country, state, city, store Product: electronics, computer, laptop, dell, inspiron This is called a “Star-Net” model in data warehousing, and allows for various operations on the dimensions and the resulting cuboids. Data Warehousing January 18, 2008 Slide 12 COMP527: Data Mining Star-Net Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP DISTRICT SALES PERSON REGION DISTRICT COUNTRY DIVISION Geography Promotion Data Warehousing February 04, 2009 Organization Slide 13 COMP527: Data Mining Data Cube Operations Roll Up: Summarise data by climbing up hierarchy. Eg. From monthly to quarterly, from Liverpool to England Drill Down: Opposite of Roll Up Eg. From computer to laptop, from £100-199 to £100-999 Slice: Remove a dimension by setting a value for it Eg. location/product where time is Q1,2007 Dice: Restrict cube by setting values for multiple dimensions Eg. Q1,Q2 / North American cities / 3 products sub cube Pivot: Rotate the cube (mostly for visualisation) Data Warehousing January 18, 2008 Slide 14 COMP527: Data Mining Data Cube Schemas Star Schema: Single fact table in the middle, with connected set of dimension tables (Hence a star) Snowflake Schema: Some of the dimension tables further refined into smaller dimension tables (Hence looks like a snow flake) Fact Constellation: Multiple fact tables can share dimension tables (Hence looks like a collection of star schemas. Also called Galaxy Schema) Data Warehousing February 04, 2009 Slide 15 COMP527: Data Mining Star Schema Time Dimension Item Dimension time_key day day_of_week month quarter year item_key name brand type supplier_type Sales Fact Table time_key item_key location_key Loc.n Dimension units_sold location_key street city state country continent Measure (value) Data Warehousing January 18, 2008 Slide 16 COMP527: Data Mining Snowflake Schema Time Dimension Item Dimension time_key day day_of_week month quarter year item_key name brand type supplier_key Sales Fact Table time_key item_key location_key units_sold Loc Dimension location_key street city_key city_key city state country Measure (value) Data Warehousing City Dimension February 04, 2009 Slide 17 COMP527: Data Mining Fact Constellation Time Dimension Item Dimension time_key day day_of_week month quarter year item_key name brand type supplier_key Sales Fact Table time_key item_key Shipping Table time_key item_key from_key location_key units_shipped units_sold Loc Dimension location_key street city_key city_key city state country Measure (value) Data Warehousing City Dimension February 04, 2009 Slide 18 COMP527: Data Mining OLAP Server Types ROLAP: Relational OLAP Uses relational DBMS to store and manage the warehouse data Optimised for non traditional access patterns Lots of research into RDBMS to make use of! MOLAP: Multidimensional OLAP Sparse array based storage engine Fast access to precomputed data HOLAP: Hybrid OLAP Mixture of both MOLAP and ROLAP Data Warehousing February 04, 2009 Slide 19 COMP527: Data Mining Data Warehouse Architecture (also courtesy of Han & Kamber) Other sources Operational DBs Metadata Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources Data Storage Data Warehousing OLAP Engine February 04, 2009 Front-End Tools Slide 20 COMP527: Data Mining Materialisation In order to compute OLAP queries efficiently, need to materialise some of the cuboids from the data. None: Very slow, as need to compute entire cube at run time Full: Very fast, but requires a LOT of storage space and time to compute all possible cuboids Partial: But which ones to materialise? Called an 'iceberg cube', as only partially materialised and the rest is "below water". Many cells in a cuboid will be empty, only materialise sections that contain more values than a minimum threshold. Data Warehousing February 04, 2009 Slide 21 COMP527: Data Mining Further Reading Han, Chapters 3,4 Dunham Sections 2.1, 2.6, 2.7 Berry and Linoff, Chapter 15 Inmon, Building the Data Warehouse Inmon, Managing the Data Warehouse http://en.wikipedia.org/wiki/Data_warehouse and subsequent links Data Warehousing February 04, 2009 Slide 22