Download Principles of Knowledge Discovery in Data Summary of Last

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Functional Database Model wikipedia , lookup

Transcript
Summary of Last Chapter
Principles of Knowledge
Discovery in Data
• What kind of information are we collecting?
• What are Data Mining and Knowledge Discovery?
Fall 2004
• What kind of data can be mined?
Chapter 2: Data Warehousing and OLAP
• What can be discovered?
Dr. Osmar R. Zaïane
• Is all that is discovered interesting and useful?
• How do we categorize data mining systems?
• What are the issues in Data Mining?
University of Alberta
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
• Are there application examples?
University of Alberta
1
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
2
Course Content
Chapter 2 Objectives
•
•
•
•
•
•
•
•
•
•
Introduction to Data Mining
Data warehousing and OLAP
Data cleaning
Data mining operations
Data summarization
Association analysis
Classification and prediction
Clustering
Web Mining
Similarity Search
•
Other topics if time permits
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
Realize the purpose of data warehousing.
Comprehend the data structures behind data
warehouses and understand the OLAP
technology.
Get an overview of the schemas used for
multi-dimensional data.
University of Alberta
3
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
4
Data Warehouse and OLAP
Outline
Incentive for a Data Warehouse
• Businesses have a lot of data, operational data and facts.
• This data is usually in different databases and in different
physical places.
• Data is available (or archived), but in different formats and
locations. (heterogeneous and distributed).
• What is a data warehouse and what is it for?
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Decision makers need to access information (data that has been
summarized) virtually on one single site.
• This access needs to be fast regardless of the size of the data, and
how old the data is.
• Are there issues related to data cube technology?
• Can we mine data warehouses?
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
5
1970s
ys
nal
ls
oo
is T
•Data Analyst
Flexible integrated
spreadsheets.
Slow access to
operational data
Principles of Knowledge Discovery in Data
g
nd
ssin
g a roce
P
sin
hou cal
are alyti
ta W An
Da -Line
On
A
ata
•Data Analyst
Inflexible and
non-integrated
tools
D
op
skt
De
al
nu
 Dr. Osmar R. Zaïane, 1999-2004
s
em
yst
d
ase rt S
l-b po
ina up
rm n S
Te cisio
De
a
dM
an
tch ng
Ba porti
Re
•Statistician
•Computer scientist
Difficult and limited
queries highly
specific to some
distinctive needs
•Executive
Integrated tools
Data Mining
University of Alberta
University of Alberta
6
• A data warehouse consolidates different data sources.
• A data warehouse is a database that is different and maintained
separately from an operational database.
• A data warehouse combines and merges information in a consistent
database (not necessarily up-to-date) to help decision support.
1990s
1980s
Principles of Knowledge Discovery in Data
What Is Data Warehouse?
Evolution of Decision Support Systems
1960s
 Dr. Osmar R. Zaïane, 1999-2004
7
Decision support systems access data warehouse and
do not need to access operational databases Î do
not unnecessarily over-load operational databases.
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
8
Definitions
Definitions (con’t)
Data Warehouse is a subject-oriented, integrated, time-variant
and non-volatile collection of data in support of management’s
decision making process. (W.H. Inmon)
Data Warehousing is the process of constructing and using
data warehouses.
Subject oriented: oriented to the major subject areas of the corporation that
have been defined in the data model.
A corporate data warehouse collects data about subjects
spanning the whole organization. Data Marts are specialized,
single-line of business warehouses. They collect data for a
department or a specific group of people.
Integrated: data collected in a data warehouse originates from different
heterogeneous data sources.
Time-variant: The dimension “time” is all-pervading in a data warehouse.
The data stored is not the current value, but an evolution of the value in time.
Non-volatile: update of data does not occur frequently in the data
warehouse. The data is loaded and accessed.
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
9
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
10
Data Warehouse and OLAP
Outline
Building a Data Warehouse
• What is a data warehouse and what is it for?
Option 1:
Consolidate Data Marts
Data Mart
Data Mart
Data Mart
Corporate
Data Warehouse
Data Mart
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
Option 2:
Build from
scratch
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Are there issues related to data cube technology?
Corporate data
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
• Can we mine data warehouses?
University of Alberta
11
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
12
Construction of Data Warehouse
Based on Multi-dimensional Model
Describing the Organization
We sell products in various
markets, and we measure our
performance over time
• Think of it as a cube with labels
on each edge of the cube.
• The cube doesn’t just have 3
dimensions, but may have many
dimensions (N).
• Any point inside the cube is at
the intersection of the coordinates
defined by the edge of the cube.
• A point in the cube may store
values (measurements) relative to
the combination of the labeled
dimensions.
e
Products
Data Warehouse Designer
 Dr. Osmar R. Zaïane, 1999-2004
Ti
m
We sell Products in various
Markets, and we measure our
performance over Time
Markets
Business Manager
Principles of Knowledge Discovery in Data
University of Alberta
13
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
14
Data Warehouse and OLAP
Outline
Concept-Hierarchies
• What is a data warehouse and what is it for?
Dimensions are hierarchical by nature: total orders or partial orders
Example: Location(continent Î country Î province Î city)
Time(yearÎquarterÎ(month,week)Îday)
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
Industry
Dimensions: Product, Region, week
Hierarchical summarization paths
Country
Category Region
Product
City
Office
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
Year
• What is the general architecture of a data warehouse?
Quarter
• How can we implement a data warehouse?
Month Week
• Are there issues related to data cube technology?
Day
University of Alberta
• Can we mine data warehouses?
15
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
16
On-Line Transaction Processing
On-Line Analytical Processing
• On-line analytical processing (OLAP) is essential for
decision support.
• OLAP is supported by data warehouses.
• Data warehouse consolidation of operational databases.
• The key structure of the data warehouse always contains
some element of time.
•Owing to the hierarchical nature of the dimensions, OLAP
operations view the data flexibly from different perspectives
(different levels of abstractions).
• Database management systems are typically used for on-line
transaction processing (OLTP)
• OLTP applications normally automate clerical data
processing tasks of an organization, like data entry and
enquiry, transaction handling, etc. (access, read, update)
• Database is current, and consistency and recoverability are
critical. Records are accessed one at a time.
¾OLTP operations are structured and repetitive
¾OLTP operations require detailed and up-to-date data
¾OLTP operations are short, atomic and isolated transactions
•OLAP operations:
DW tend to be in the order of Tb
Databases tend to be hundreds of Mb to Gb.
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
17
 Dr. Osmar R. Zaïane, 1999-2004
Time (Quarters)
Minivan.
Location
Edmonton
Calgary
Jul Aug Sep
Compact
(city, AB)Lethbridge
Mid-size
Family
Red Deer
Category
Minivan.
Drill down on Q3
OLTP
OLAP
Clerk, IT professional
Knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
Roll-up on Location
Time (Quarters)
Q1 Q2 Q3 Q4
Maritimes
Compact
Quebec
(province, Ontario
Mid-size
Prairies
Canada)Western
Pr
Family
Time (Quarters)
access
Category
Minivan.
Category
Pivot
Time (Quarters)
er
De
Red ge
id
hbr
Le t g a r y n
l
Ca onto
m
Ed
Q1 Q2 Q3 Q4
Edmonton
Location Calgary
(city, AB)Lethbridge
Mid-size
Red Deer
 Dr. Osmar R. Zaïane, 1999-2004
usage
Location
Slice on Category=mid-size
Q1
Q2
Q3
Q4
Location
(city, AB)
Mid-size
Principles of Knowledge Discovery in Data
University of Alberta
users
Time (Months, Q3)
Category
Principles of Knowledge Discovery in Data
18
OLTP vs OLAP
Data Warehouse OLAP Example
Q1 Q2 Q3 Q4
Edmonton
Location Calgary
Compact
(city, AB)Lethbridge
Mid-size
Red Deer
Family
• roll-up (increase the level of abstraction)
• drill-down (decrease the level of abstraction)
• slice and dice (selection and projection)
• pivot (re-orient the multi-dimensional view)
• drill-through (links to the raw data)
complex query
Category
(Source: JH)
University of Alberta
19
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
20
Why Do We Separate DW From DB?
• Performance reasons:
• What is a data warehouse and what is it for?
– OLAP necessitates special data organization
that supports multidimensional views.
– OLAP queries would degrade operational DB.
– OLAP is read only.
– No concurrency control and recovery.
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Decision support requires historical data.
• Decision support requires consolidated data.
 Dr. Osmar R. Zaïane, 1999-2004
Data Warehouse and OLAP
Outline
Principles of Knowledge Discovery in Data
University of Alberta
• Are there issues related to data cube technology?
• Can we mine data warehouses?
21
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
22
University of Alberta
Three-tier Architecture
Three-tier Architecture
Monitor
&
Integrator
metadata
External
sources
Operational
DBs
Extract
Transform
Load
Refresh
Data Sources
 Dr. Osmar R. Zaïane, 1999-2004
Operational
DBs
Extract
Transform
Load
Refresh
 Dr. Osmar R. Zaïane, 1999
OLAP
Server
Analysis
Serve
Data
Warehouse
Query
Reports
Data
mining
Client Tools
Data Marts
Principles of Knowledge Discovery in Databases
University of Alberta
23
• Data sources are often the operational systems,
providing the lowest level of data.
• Data sources are designed for operational use, not for
decision support, and the data reflect this fact.
• Multiple data sources are often from different systems
run on a wide range of hardware and much of the
software is built in-house or highly customized.
• Multiple data sources introduce a large number of
issues -- semantic conflicts.
Analysis
Serve
Query
Reports
Data
mining
Client Tools
Data sources
External
sources
Data
sources
OLAP Server
Data
Warehouse
Monitor
&
Integrator
metadata
Data Marts
Principles of Knowledge Discovery in Data
University of Alberta
23
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
24
Three-tier Architecture
Data Cleaning
Monitor
&
Integrator
metadata
External
sources
Operational
DBs
OLAP
Server
Three-tier Architecture
Load and Refresh
Analysis
Serve
Extract
Transform
Load
Refresh
Query
Reports
Data
Warehouse
Data
mining
Client Tools
Data
sources
Principles of Knowledge Discovery in Databases
University of Alberta
Principles of Knowledge Discovery in Data
Operational
DBs
Monitor
External
sources
Operational
DBs
Extract
Transform
Load
Refresh
Data
sources
 Dr. Osmar R. Zaïane, 1999
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
23
26
University of Alberta
Three-tier Architecture
Monitor
&
Integrator
metadata
Analysis
External
sources
Integrator
Query
Reports
Data
mining
Client Tools
Data Marts
Principles of Knowledge Discovery in Databases
University of Alberta
• Data shipping: using triggers to update snapshot log table and propagate
the updated data to the warehouse.
• Transaction shipping: shipping the updates in the transaction log.
Serve
Data
Warehouse
Data Marts
Principles of Knowledge Discovery in Databases
– How to refresh.
25
OLAP
Server
Data
mining
Client Tools
• Determined by usage, types of data source, etc.
Three-tier Architecture
Monitor
&
Integrator
Analysis
Query
Reports
Data
Warehouse
• Loading the warehouse includes some other processing tasks:
– Checking integrity constraints, sorting, summarizing, build
indices, etc.
• Refreshing a warehouse means propagating updates on source data
to the data stored in the warehouse.
– When to refresh.
University of Alberta
metadata
OLAP
Server
Serve
Extract
Transform
Load
Refresh
 Dr. Osmar R. Zaïane, 1999
23
• Data cleaning is important to warehouse.
– Operational data from multiple sources are often
noisy (may contain data that is unnecessary for DS).
• Three classes of tools.
– Data migration: allows simple data transformation.
– Data scrubbing: uses domain-specific knowledge
to scrub data.
– Data auditing: discovers rules and relationships by
scanning data (detect outliers).
 Dr. Osmar R. Zaïane, 1999-2004
External
sources
Data
sources
Data Marts
 Dr. Osmar R. Zaïane, 1999
Monitor
&
Integrator
metadata
University of Alberta
Operational
DBs
Extract
Transform
Load
Refresh
Data
sources
 Dr. Osmar R. Zaïane, 1999
23
OLAP
Server
Analysis
Serve
Data
Warehouse
Query
Reports
Data
mining
Client Tools
Data Marts
Principles of Knowledge Discovery in Databases
University of Alberta
23
• Receive changes from the monitors
• Detect changes to an information source that are of interest
to the warehouse.
– Make the data conform to the conceptual
schema used by the warehouse
– Define triggers in a full-functionality DBMS.
– Examine the updates in the log file.
• Integrate the changes into the warehouse
– Write programs for legacy systems.
– Merge the data with existing data already
present
• Propagate the change in a generic form to the integrator.
– Resolve possible update anomalies
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
27
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
28
Three-tier Architecture
Monitor
&
Integrator
metadata
Metadata Repository
External
sources
Operational
DBs
Extract
Transform
Load
Refresh
Data
sources
 Dr. Osmar R. Zaïane, 1999
• Administrative metadata
OLAP
Server
Three-tier Architecture
Metadata Repository
Serve
Data
Warehouse
Query
Reports
Data
mining
Data Marts
University of Alberta
University of Alberta
 Dr. Osmar R. Zaïane, 1999
23
Data
Warehouse
Analysis
Query
Reports
Data
mining
Client Tools
Data Marts
Principles of Knowledge Discovery in Databases
University of Alberta
23
• Business data
Source database and their contents
Gateway descriptions
Warehouse schema, view and derived data definitions
Dimensions and hierarchies
Pre-defined queries and reports
Data mart locations and contents
Data partitions
Data extraction, cleansing, transformation rules,
defaults
– Data refresh and purge rules
– User profiles, user groups
– Security: user authorization, access control
Principles of Knowledge Discovery in Data
Operational
DBs
OLAP
Server
Serve
Extract
Transform
Load
Refresh
Data
sources
–
–
–
–
–
–
–
–
 Dr. Osmar R. Zaïane, 1999-2004
External
sources
Client Tools
Principles of Knowledge Discovery in Databases
Monitor
&
Integrator
metadata
Analysis
– business terms and definitions
– ownership of data
– charging policies
• Operational metadata
– data lineage: history of migrated data and sequence of
transformations applied
– currency of data: active, archived, purged
– monitoring information: warehouse usage statistics,
error reports, audit trails
29
Data Warehouse and OLAP
Outline
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
30
Data Warehouse Design
Most data warehouses use a star schema to represent the multidimensional model.
• What is a data warehouse and what is it for?
Each dimension is represented by a dimension-table that
describes it.
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
A fact-table connects to all dimension-tables with a multiple
join. Each tuple in the fact-table consists of a pointer to each of
the dimension-tables that provide its multi-dimensional
coordinates and stores measures for those coordinates.
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Are there issues related to data cube technology?
The links between the fact-table in the centre and the dimensiontables in the extremities form a shape like a star. (Star Schema)
• Can we mine data warehouses?
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
31
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
32
Example of Star Schema
Product
Date
Date
Month
Year
Sales Fact Table
Date
Product
Store
StoreID
City
State
Country
Region
Data Warehouses Design (con’t)
• Modeling data warehouses: dimensions & measurements
ProductNo
ProdName
ProdDesc
Category
Star schema: A single object (fact table) in the middle connected
to a number of objects (dimension tables)
Store
Each dimension is represented by one table
ÎUn-normalized (introduces redundancy).
Cust
Customer
unit_sales
dollar_sales
CustId
CustName
CustCity
CustCountry
Ex:
(Edmonton, Alberta, Canada, North America)
(Calgary, Alberta, Canada, North America)
Normalize dimension tables Î Snowflake schema
(Source: JH)
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
33
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
34
University of Alberta
Example of Snowflake Schema
Data Warehouses Design (con’t)
Year
Year
• Snowflake schema: A refinement of star schema where the
dimensional hierarchy is represented explicitly by normalizing the
dimension tables.
Product
Month
Month
Year
Date
Sales Fact Table
Date
Month
Date
Product
Store
City
• Fact constellations: Multiple fact tables share dimension tables.
State
Country
City
State
State
Country
StoreID
City
ProductNo
ProdName
ProdDesc
Category
Store
Cust
Customer
unit_sales
dollar_sales
CustId
CustName
CustCity
CustCountry
Country
Region
(Source: JH)
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
35
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
36
What Is the Best Design?
Aggregation in Data Warehouses
Multidimensional view of data in the warehouse:
Stress on aggregation of measures by one or more
dimensions
Performance benchmarking can be used to determine what is the
best design.
Snowflake schema: Easier to maintain dimension tables when
dimension table are very large (reduces overall space).
The Data Cube and
The Sub-Space Aggregates
By City
Star schema: More effective for data cube browsing (less joins):
can affect performance.
Group By
Cross Tab
Q1Q2Q3Q4
Category
Aggregate
By Category
Drama
Comedy
Horror
Drama
Comedy
Horror
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
37
 Dr. Osmar R. Zaïane, 1999-2004
Construction of Multi-dimensional
Data Cube
By Time
By Time & City
Drama
Comedy
Horror
By Category & City
By Time
Sum
Sum
Q3Q Red Deer
4
Q1 Q2
Lethbridge
Calgary
Edmonton
By Time & Category
Sum
Sum
By Category
Principles of Knowledge Discovery in Data
University of Alberta
38
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
Amount
B.C.
Prairies
Ontario
sum
Province
0-20K20-40K 40-60K60K- sum
All Amount
Algorithms, B.C.
Contracts
Air-Express
Algorithms
Truck
Order
Database
… ...
Product Line
Discipline
Time
Product
Annually
Quarterly Daily
sum
Product Item
Product Group
City
Sales Person
Country
District
Region
Division
Geography
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
39
 Dr. Osmar R. Zaïane, 1999-2004
Promotion
Principles of Knowledge Discovery in Data
Organization
University of Alberta
40
Implementation of the OLAP Server
Implementation of the OLAP Server
ROLAP: Relational OLAP - data is stored in tables in relational
database or extended-relational databases. They use an RDBMS to
manage the warehouse data and aggregations using often a star
schema.
•They support extensions to SQL
•A cell in the multi-dimensional structure is represented by a tuple.
Advantage: Scalable (no empty cells for sparse cube).
Disadvantage: no direct access to cells.
MOLAP: Multidimensional OLAP – implements the
multidimensional view by storing data in special multidimensional
data structures (MDDS)
Advantage: Fast indexing to pre-computed aggregations. Only
values are stored.
Disadvantage: Not very scalable and sparse
Ex: Essbase of Arbor
Ex: Microstrategy
Metacube (Informix)
HOLAP: Hybrid OLAP - combines ROLAP and MOLAP
technology. (Scalability of ROLAP and faster computation of MOLAP)
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
41
View of Warehouses and Hierarchies
with DBMiner
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
42
DBMiner Cube Visualization
• Importing data
• Table Browsing
• Dimension creation
• Dimension browsing
• Cube building
• Cube browsing
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
43
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
44
Example DIVE-ON Project
Example DIVE-ON Project
Construction
1
Communication
Visualization
Federated
N-dimensional
data warehouses
Local 3-dimensional
data cube
Virtual Warehouse
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
45
 Dr. Osmar R. Zaïane, 1999-2004
3
Interaction
4
CORBA+XML
CORBA+XML
Distributed
databases
2
Principles of Knowledge Discovery in Data
CAVE
VCU
Immersed
user
UIM
University of Alberta
46
Data Warehouse and OLAP
Outline
Example DIVE-ON Project
• What is a data warehouse and what is it for?
Data Source 1
Schema
Schema
DCC Shell
ORB
ORBServer
Server
DBMS
DBMS
Interface
Interface(Java
(JavaAPI)
API)
Query
Query
Engine
Engine
DBMS
DBMS
• What is the difference between OLAP and OLTP?
• What is the general architecture of a data warehouse?
Query
Query
Executer
Executer
• How can we implement a data warehouse?
Query
QueryDistributor
Distributor
ORB
ORBClient
Client
 Dr. Osmar R. Zaïane, 1999-2004
Schema
Schema
• What is the multi-dimensional data model?
DCC
DCC
Schema
Schema
ORB
ORBServer
Server
SOAP
SOAPServer
Server
Data Source 2
SOAP
SOAPServer
Server
• Are there issues related to data cube technology?
• Can we mine data warehouses?
SOAP
SOAPClient
Client
Principles of Knowledge Discovery in Data
University of Alberta
47
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
48
Data Warehouse and OLAP
Outline
Issues
• What is a data warehouse and what is it for?
•
•
•
•
Scalability
Sparseness
Curse of dimensionality
Materialization of the multidimensional
data cube (total, virtual, partial)
• Efficient computation of aggregations
• Indexing
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Are there issues related to data cube technology?
• Can we mine data warehouses?
49
Data Mining
‰ Data mining requires integrated, consistent and
cleaned data which data warehouses can provide.
‰ Data mining tools can interface with the OLAP
engine to take advantage of the integrated and
aggregated data, as well as the navigation power.
‰ Interactive and exploratory mining.
‰ OLAP-based mining is referred to as OLAPmining or OLAM (on-line analytical mining).
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
51
 Dr. Osmar R. Zaïane, 1999-2004
Principles of Knowledge Discovery in Data
University of Alberta
50