Download Data Mining - Computer Science Intranet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2009
This is the full course notes, but not quite complete. You should come to the lectures anyway. Really.
Data Warehousing
February 04, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
Data Warehousing
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
February 04, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Data Warehouses
Data Cubes
Warehouse Schemas
OLAP
Materialisation
Data Warehousing
February 04, 2009
Slide 3
COMP527:
Data Mining
What is a Data Warehouse?
Most common definition:
“A data warehouse is a subject-oriented, integrated, time-variant
and nonvolatile collection of data in support of management's
decision-making process.” - W. H. Inmon

Corporate focused, assumes a lot of data, and typically sales
related

Data for “Decision Support System” or “Management Support
System”

1996 survey: Return on Investment of 400+%
Data Warehousing: Process of constructing (and using) a data
warehouse
Data Warehousing
February 04, 2009
Slide 4
COMP527:
Data Mining

Subject-oriented:
 Focused on important subjects, not transactions


Data Warehouse
Concise view with only useful data for decision making
Integrated:
 Constructed from multiple, heterogeneous data sources.
Normally distributed relational databases, not necessarily
same schema.
 Cleaning, pre-processing techniques applied for missing data,
noisy data, inconsistent data (sounds familiar, I hope)
Data Warehousing
February 04, 2009
Slide 5
COMP527:
Data Mining
Data Warehouse

Time-variant:
 Has different values for the same fields over time.
 Operational database only has current value. Data
Warehouse offers historical values.

Nonvolatile:
 Physically separate store
 Updates not online, but in offline batch mode only
 Read only access required, so no concurrency issues
Data Warehousing
February 04, 2009
Slide 6
COMP527:
Data Mining
Data Warehouse
Data Warehouses are distinct from:


Distributed DB: Integrated via wrappers/mediators. Far too
slow, semantic integration much more complicated.
Integration done before loading, not at run time.
Operational DB: Only records current value, lots of extra non
useful information such as HR.
Different schemas/models, access patterns, users, functions,
even though the data is derived from an operational db.
Data Warehousing
February 04, 2009
Slide 7
COMP527:
Data Mining
OLAP vs OLTP
OLAP: Online Analytical Processing (Data Warehouse)
OLTP: Online Transaction Processing (Traditional DBMS)
OLAP data typically: historical, consolidated, and multidimensional (eg: product, time, location).
Involves lots of full database scans, across terabytes or more of
data.
Typically aggregation and summarisation functions.
Distinctly different uses to OLTP on the operational database.
Data Warehousing
February 04, 2009
Slide 8
COMP527:
Data Mining
Data Cubes
Data is normally Multi-Dimensional,
and can be thought of as a cube.
Often: 3 dimensions of time,
location and product.
No need to have just 3
dimensions -- could have one
for cars with make, colour,
price, location, and time
for example.
Image courtesy of IBM OLAP Miner documentation
Data Warehousing
February 04, 2009
Slide 9
COMP527:
Data Mining



Data Cubes
Can construct many 'cuboids' from the full cube by excluding
dimensions.
In an N dimensional data cube, the cuboid with N dimensions is
the 'base cuboid'. A 0 dimensional cuboid (other than non
existent!) is called the 'apex cuboid'.
Can think of this as a lattice of cuboids...
(Following lattice courtesy of Han & Kamber)
Data Warehousing
February 04, 2009
Slide 10
COMP527:
Data Mining
Lattice of Cuboids
all
time
item
0-D(apex) cuboid
location supplier
1-D cuboids
time,item
time,location
item,location
time,supplier
time,item,location
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Data Warehousing
February 04, 2009
Slide 11
COMP527:
Data Mining
Multi-dimensional Units
Each dimension can also be thought of in terms of different units.
 Time: decade, year, quarter, month, day, hour
(and week,
which isn't strictly hierarchical with the others!)
 Location: continent, country, state, city, store
 Product: electronics, computer, laptop, dell, inspiron
This is called a “Star-Net” model in data warehousing, and allows
for various operations on the dimensions and the resulting
cuboids.
Data Warehousing
January 18, 2008
Slide 12
COMP527:
Data Mining
Star-Net Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
Product
ANNUALY
QTRLY
DAILY
PRODUCT ITEM
PRODUCT GROUP
DISTRICT
SALES PERSON
REGION
DISTRICT
COUNTRY
DIVISION
Geography
Promotion
Data Warehousing
February 04, 2009
Organization
Slide 13
COMP527:
Data Mining

Data Cube Operations
Roll Up: Summarise data by climbing up hierarchy.
Eg. From monthly to quarterly, from Liverpool to England

Drill Down: Opposite of Roll Up
Eg. From computer to laptop, from £100-199 to £100-999

Slice: Remove a dimension by setting a value for it
Eg. location/product where time is Q1,2007

Dice: Restrict cube by setting values for multiple dimensions
Eg. Q1,Q2 / North American cities / 3 products sub cube

Pivot: Rotate the cube (mostly for visualisation)
Data Warehousing
January 18, 2008
Slide 14
COMP527:
Data Mining



Data Cube Schemas
Star Schema: Single fact table in the middle, with connected set
of dimension tables
(Hence a star)
Snowflake Schema: Some of the dimension tables further
refined into smaller dimension tables
(Hence looks like a snow flake)
Fact Constellation: Multiple fact tables can share dimension
tables
(Hence looks like a collection of star schemas. Also called
Galaxy Schema)
Data Warehousing
February 04, 2009
Slide 15
COMP527:
Data Mining
Star Schema
Time Dimension
Item Dimension
time_key
day
day_of_week
month
quarter
year
item_key
name
brand
type
supplier_type
Sales Fact Table
time_key
item_key
location_key
Loc.n Dimension
units_sold
location_key
street
city
state
country
continent
Measure (value)
Data Warehousing
January 18, 2008
Slide 16
COMP527:
Data Mining
Snowflake Schema
Time Dimension
Item Dimension
time_key
day
day_of_week
month
quarter
year
item_key
name
brand
type
supplier_key
Sales Fact Table
time_key
item_key
location_key
units_sold
Loc Dimension
location_key
street
city_key
city_key
city
state
country
Measure (value)
Data Warehousing
City Dimension
February 04, 2009
Slide 17
COMP527:
Data Mining
Fact Constellation
Time Dimension
Item Dimension
time_key
day
day_of_week
month
quarter
year
item_key
name
brand
type
supplier_key
Sales Fact Table
time_key
item_key
Shipping Table
time_key
item_key
from_key
location_key
units_shipped
units_sold
Loc Dimension
location_key
street
city_key
city_key
city
state
country
Measure (value)
Data Warehousing
City Dimension
February 04, 2009
Slide 18
COMP527:
Data Mining
OLAP Server Types
ROLAP: Relational OLAP
 Uses relational DBMS to store and manage the warehouse data
 Optimised for non traditional access patterns
 Lots of research into RDBMS to make use of!
MOLAP: Multidimensional OLAP
 Sparse array based storage engine
 Fast access to precomputed data
HOLAP: Hybrid OLAP
 Mixture of both MOLAP and ROLAP
Data Warehousing
February 04, 2009
Slide 19
COMP527:
Data Mining
Data Warehouse Architecture
(also courtesy of Han & Kamber)
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP
Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
Data Warehousing
OLAP Engine
February 04, 2009
Front-End Tools
Slide 20
COMP527:
Data Mining
Materialisation
In order to compute OLAP queries efficiently, need to materialise some of
the cuboids from the data.

None: Very slow, as need to compute entire cube at run time

Full: Very fast, but requires a LOT of storage space and time to
compute all possible cuboids

Partial: But which ones to materialise? Called an 'iceberg cube',
as only partially materialised and the rest is "below water".
Many cells in a cuboid will be empty, only materialise sections that
contain more values than a minimum threshold.
Data Warehousing
February 04, 2009
Slide 21
COMP527:
Data Mining






Further Reading
Han, Chapters 3,4
Dunham Sections 2.1, 2.6, 2.7
Berry and Linoff, Chapter 15
Inmon, Building the Data Warehouse
Inmon, Managing the Data Warehouse
http://en.wikipedia.org/wiki/Data_warehouse
and subsequent links
Data Warehousing
February 04, 2009
Slide 22