Download Conceptual Modeling of Data Warehouses

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining and Data Warehousing:
Concepts and Techniques
Course
outlines
Conceptual Modeling of Data
Warehouses
Defining a Snowflake Schema in
Data Mining Query Language
DMQL
A multi-dimensional data model
Conceptual Modeling of Data Warehouses
Modeling data warehouses:
dimensions & measurements
Star schema:
A single object (fact table) in the middle connected to a number of objects
(dimension tables)
Snowflake schema:
A refinement of star schema where the dimensional hierarchy is represented
explicitly by normalizing the dimension tables.
Fact constellations:
Multiple fact tables share dimension tables.
2
Knowledge
Conceptual Modeling of Data Warehouses
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Example of Star Schema
Data
Warehouse
Data
Cleanin
g
Date
Day
Month
Year
Selection
Data Integration
Databases
Product
Sales Fact Table
Date
Store
Product
StoreID
City
State
Country
Region
Store
Customer
unit_sales
dollar_sales
Yen_sales
ProductNo
ProdName
ProdDesc
Category
QOH
Cust
CustId
CustName
CustCity
CustCountry
Measurements
3
Knowledge
Conceptual Modeling of Data Warehouses
Example of Snowflake
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Schema
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Databases
Year
Year
Product
Month
Month
Year
Date
Day
Month
Store
City
City
State
State
Country
Country
Region
StoreID
City
State
Country
Sales Fact Table
Date
Product
Store
Customer
unit_sales
dollar_sales
Yen_sales
ProductNo
ProdName
ProdDesc
Category
QOH
Cust
CustId
CustName
CustCity
CustCountry
Measurements
4
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
5
Conceptual Modeling of Data Warehouses
A Data Mining Query Language:
DMQL Language Primitives
Cube Definition ( Fact Table )
Knowledge
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Databases
define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)
First time as “cube definition”
define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>
6
Conceptual Modeling of Data Warehouses
Defining a Snowflake Schema in DMQL
define cube sales [date, product, store, customer]:
dollar_sales = sum(sales_in_dollars),
yen_sales = sum(sales_in_yens), unit_sales = count(*)
define dimension product as
(product_no, prod_name,prod_desc, category, QOH)
define dimension cust as (custID, cust_name, cust_city, cust_country)
define dimension date as (day, month (month_key, year (year_key) ) )
define dimension store as ( storeID, city ( city_key,
state( state_key,
country(country_key, region)
)))
7
A concept hierarchy: dimension
(location)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Young
8
View of Warehouses and Hierarchies
 Importing data
 Table Browsing
 Dimension creation
 Dimension browsing
 Cube building
 Cube browsing
9
Multidimensional Data
Sales volume as a function of product, month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month Week
Day
Month
10
From Tables and Spreadsheets to Data Cubes
 A data warehouse is based on a multidimensional data model which views
data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and viewed in multiple
dimensions
 Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables
 In data warehousing literature, an n-D base cube is called a base cuboid.
The top most 0-D cuboid, which holds the highest-level of summarization, is
called the apex cuboid. The lattice of cuboids forms a data cube.
11
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time
item
location
supplier
1-D cuboids
time,item
time,location
item,location
time,supplier
time,item,location
location,supplier
item,supplier
2-D cuboids
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
time, item, location, supplier
4-D(base) cuboid
12
A multi-dimensional data model
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
sum
13
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
14
Browsing a Data Cube
Visualization
OLAP capabilities
Interactive manipulation
15
Data warehousing
Typical OLAP Operations
 Roll up (drill-up): summarize data; by climbing up hierarchy or by
dimension reduction
 Drill down (roll down): reverse of roll-up; from higher level
summary to lower level summary or detailed data, or introducing
new dimensions
 Slice and dice: project and select
 Pivot (rotate): reorient the cube, visualization, 3D to series of 2D
planes.
 Other operations
 drill across: involving (across) more than one fact table.
 drill through: through the bottom level to its back-end relational tables.
16
OLAP Operations
Roll Up
Drill Down
Single Cell
Multiple Cells
Slice
Dice
17
Drill-up, drill-down
Roll up
Alim.
Roll up
05-07
05
06
07
496
520
255
05
06
07
Dimension
Temps
1S05
2S05
1S06
2S06
1S07
Fruits
623
Fruits
221
263
139
Fruits
100
121
111
152
139
Viande
648
Viande
275
257
116
Viande
134
141
120
137
116
05
06
07
Pomme
20
19
22
…
…
…
…
Boeuf
40
43
48
Dimension
Produit
Drill down
Roll down
Drill down
Roll down
18
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
Product
ANNUALY QTRLY
DAILY
PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
Promotion
Organization
19
Design of a Data Warehouse
Design of a Data Warehouse: A Business Analysis Framework
Four different views regarding the design of a data warehouse must
be considered
1. Top-down view: Allows selection of the relevant information
necessary for the data warehouse.
2. Data source view: exposes the information being captured,
stored, and managed by operational systems
3. Data warehouse view: fact tables and dimension tables
4. Business query view: perspectives of data in the warehouse
from the view of end-user.
21
The Process of Data Warehouse Design
 Top-down, bottom-up approaches or a combination of both
 Top-down: Starts with overall design and planning (mature)
 Bottom-up: Starts with experiments and prototypes (rapid)
 From software engineering point of view
 Waterfall: structured and systematic analysis at each step before
proceeding to the next.
 Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around.
 Typical data warehouse design process:
 Choose a business process to model, e.g., orders, invoices, etc.
 Choose the grain (atomic level of data) of the business process
 Choose the dimensions that will apply to each fact table record
 Choose the measure that will populate each fact table record.
22
Data warehouse software architecture
Multi-Tiered Architecture
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load ETL
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
23
Data warehouse software architecture
Three-Tier Data Warehouse Architecture
 Enterprise warehouse:
collects all of the information about subjects
spanning the entire organization.
 Data Mart:
Other
sourc
es
Operational
DBs
Metadata
Extract
Transform
Load ETL
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serv
e
Analysis
Query
Reports
Data mining
Data Marts
a subset of corporate-wide data that is of value to a specific groups of
users.
 Its scope is confined to specific, selected groups, such as marketing data mart.
 Independent vs. dependent (directly from warehouse) data mart
 Virtual warehouse:
 A set of views over operational databases.
 Only some of the possible summary views may be materialized.
24
Data Warehouse Development:
A Recommended Approach
Multi-Tier
Data
Warehouse
Distributed
Data Marts
Data
Mart
Data
Mart
Model refinement
Enterprise
Data
Warehouse
Model refinement
Define a high-level corporate data model
25
Building Data Warehouses
Approaches to Building Warehouses
… OLAP Server Architectures
Relational OLAP (ROLAP):
relational database system tuned for star schemas, e.g., using special index structures such as:
 “Bitmap indexes” (for each key of a dimension table, e.g., bar name, a bit-vector telling which tuples of the fact table
have that value).
 Materialized views = answers to general queries from which more specific queries can be answered with less work
than if we had to work from the raw data.
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP
middle ware to support missing pieces.
Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
Multidimensional OLAP (MOLAP) - A specialized model based on a “cube”
view of data.
Array-based multidimensional storage engine (sparse matrix techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) - User flexibility,
e.g., low level: relational, high-level: array.
Specialized SQL servers - specialized support for SQL queries over star, snowflake
schemas
26
Building Data Warehouses
Metadata Repository
Meta data are the data
defining warehouse objects
 Description the structure of the warehouse - schema, view, dimensions,
hierarchies, derived data defn, data mart locations and contents
 Operational meta-data - data lineage (history of migrated data and
transformation path), currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Data related to system performance - warehouse schema, view and derived data
definitions
 Business data - business terms and definitions, ownership of data, charging
policies
27
Building Data Warehouses
Data Warehouse Back-End Tools and Utilities
Other
sourc
es
Operational
DBs
Metadata
Extract
Transform
Load ETL
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serv
e
Analysis
Query
Reports
Data mining
Data Marts
Data Extraction: get data from multiple, heterogeneous, and
external sources
Data cleaning: detect errors in the data and rectify them when
possible
Data Transformation: convert data from legacy or host format to
warehouse format
Load: sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
Refresh: propagate the updates from the data sources to the
warehouse
28
From data warehousing to data mining
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing - supports querying, basic statistical
analysis, and reporting using crosstabs, tables, charts and
graphs
Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.
29
From data warehousing to data mining
From On-Line Analytical Processing OLAP
to On Line Analytical Mining OLAM
 Why online analytical mining?
 High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data
warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP
tools
OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
integration and swapping of multiple mining functions, algorithms, and
tasks.
 Architecture of OLAM
30
From data warehousing to data mining
An OLAM Architecture
Mining query
On Line Analytical Mining
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
Data Cube API
OLAP
Engine
MDDB
Meta Data
Database API
Filtering &
Integration
Databases
Data cleaning
Data integration
Filtering
Data
Warehouse
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
31