Download Data cleaning, which

Document related concepts

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data Warehousing
What is Data Warehouse, Multidimensional Data
Model, Data Warehouse Architecture, Data
warehouse
Implementation,
From
Data
Warehouse to Data mining.
May 7, 2017
Data Mining: Concepts and Techniques
1
What is Data Warehouse?
A decision support database that is maintained separately from the
organization’s operational database

Support information processing by providing a solid platform of
consolidated, historical data for analysis.
Data warehousing provides architectures and tools for business
executives to systematically organize, understand and use their
data to make strategic decisions

Data warehousing:The process of constructing and using data warehouses
May 7, 2017
Data Mining: Concepts and Techniques
2


“A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”—William. H. Inmon
The four keywords subject-oriented, integrated, time-variant and
nonvolatile, distinguish data warehouses from other data repository
systems such as relational database systems, transaction processing
systems, and file systems.


May 7, 2017
Time- variant: data are stored to provide information from a historical
perspective.
Non- volatile: a data warehouse is always a physically separate store of
data transformed from the application data found in the operational
environment.
Data Mining: Concepts and Techniques
3
Data Warehouse—Subject-Oriented

Subject-oriented: a data warehouse is organized around major subjects.

Organized around major subjects, such as customer,
product, sales

Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing

Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
May 7, 2017
Data Mining: Concepts and Techniques
4
Data Warehouse—Integrated


Constructed by integrating multiple, heterogeneous data
sources
 relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources


May 7, 2017
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
Data Mining: Concepts and Techniques
5
Data Warehouse—Time Variant

The time horizon for the data warehouse is significantly
longer than that of operational systems



Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse


May 7, 2017
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain “time element”
Data Mining: Concepts and Techniques
6
Data Warehouse—Nonvolatile

A physically separate store of data transformed from the
operational environment

Operational update of data does not occur in the data
warehouse environment

Does not require transaction processing, recovery,
and concurrency control mechanisms

Requires only two operations in data accessing:

May 7, 2017
initial loading of data and access of data
Data Mining: Concepts and Techniques
7
Differences between Operational Database Systems and
Data Warehouses


The major task of on-line operational database systems is
to perform on-line transaction and query processing.
These systems are called on-line transaction
processing(OLTP)systems.
Data warehouse systems, on the other hand, serve users
or knowledge workers in the role of data analysis and
decision making. Such systems can organize and present
data in various formats in order to accommodate the
diverse needs of the different users. These systems are
known as on-line analytical processing(OLAP) systems.
May 7, 2017
Data Mining: Concepts and Techniques
8
Differences between Operational Database Systems and
Data Warehouses
1)
2.
Users and System orientation : An OLTP system is
customer-oriented and is used for transaction and query
processing by clerks, clients and information technology
professionals.
Am OLAP system is market-oriented and is used for data
analysis by knowledge workers , including managers ,
executives and analysts.
Data Contents : An OLTP system current data that ,
typically, are too detailed to be easily used for decision
making
An OLAP system manages large amount of historical
data.
May 7, 2017
Data Mining: Concepts and Techniques
9
Differences between Operational Database Systems and
Data Warehouses
3.
Database Design:
An OLTP system usually
adopts an entity-relation ship (ER) data model
and an application oriented database design.
An OLAP system typically adopts either a star or
snowflake model and subject oriented data
base design.
May 7, 2017
Data Mining: Concepts and Techniques
10
Differences between Operational Database Systems and
Data Warehouses
4.
View: An OLTP system focuses mainly on the
current data within an enterprise or department
, without referring to historical data in different.
An OLAP system often spans multiple versions
of a database schema , due to the evolutionary
process of an organization.
OLAP – deal with information that originate from
different organizations , integrating information
from many data stores. Because of their huge
volume , OLAP data are stored on multiple
storage media.
May 7, 2017
Data Mining: Concepts and Techniques
11
Differences between Operational Database Systems and
Data Warehouses
5 Access Patterns –
OLTP – Short atomic transactions. ( System
requires concurrency control and recovery
mechanisms)
OLAP – Read only operations ( Historical rather
than up to date information) . Complex queries.
May 7, 2017
Data Mining: Concepts and Techniques
12
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
May 7, 2017
complex query
Data Mining: Concepts and Techniques
13
Multidimensional Data Model

A data warehouse is based on a multidimensional data model which
views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions

Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables

In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
May 7, 2017
Data Mining: Concepts and Techniques
14
May 7, 2017
Data Mining: Concepts and Techniques
15
May 7, 2017
Data Mining: Concepts and Techniques
16
May 7, 2017
Data Mining: Concepts and Techniques
17
Cube: A Lattice of Cuboids
all
time
0-D(apex) cuboid
item
time,location
time,item
location
supplier
item,location
time,supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
May 7, 2017
Data Mining: Concepts and Techniques
18
Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to a
set of dimension tables

Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake

Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
May 7, 2017
Data Mining: Concepts and Techniques
19
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
Measures
May 7, 2017
Data Mining: Concepts and Techniques
20
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
May 7, 2017
Data Mining: Concepts and Techniques
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_province
country
21
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
shipper_key
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
Measures
May 7, 2017
time_key
from_location
branch_key
branch
Shipping Fact Table
Data Mining: Concepts and Techniques
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type 22
Cube Definition Syntax in DMQL


Cube Definition
define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
May 7, 2017
Data Mining: Concepts and Techniques
23
Defining Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
May 7, 2017
Data Mining: Concepts and Techniques
24
Defining Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
May 7, 2017
Data Mining: Concepts and Techniques
25
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
May 7, 2017
Data Mining: Concepts and Techniques
26
Measures of Data Cube: Three Categories

Distributive: if the result derived by applying the function
to n aggregate values is the same as that derived by
applying the function on all the data without partitioning


Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function


E.g., count(), sum(), min(), max()
E.g., avg(), min_N(), standard_deviation()
Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.

May 7, 2017
E.g., median(), mode()
Data Mining: Concepts and Techniques
27

DMQL to SQL
Select s.time_key , s.item_key , s.branch_key . S.location_key ,
sum(s.number_of_units_sold * s.price) ,
sum(s.number_of_units_sold)
Rom time t , item I , branch b , location l , Sales s,
Where s.time_key = t.time_key and s.item_key = i.item_key and
S.brach_key = b.branch_key and s,location_key = l.location_key
Group by s.time_key , s.item_key , s.branch_key , s.location_key
May 7, 2017
Data Mining: Concepts and Techniques
28
Concept Hierarchies

Concept Hierarchies defines a sequence of
mappings from a set of low-level concepts to
higher level, more general concepts
May 7, 2017
Data Mining: Concepts and Techniques
29
A Concept Hierarchy: Dimension (location)
all
all
Europe
region
country
city
office
May 7, 2017
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
Data Mining: Concepts and Techniques
...
Mexico
Toronto
M. Wind
30
A concept hierarchy that is a total or partial
order among attributes in a database schema is
called a schema hierarchy.
May 7, 2017
Data Mining: Concepts and Techniques
31
Concept hierarchies may also be defined by grouping values
for a given dimension or attribute, resulting in a set group
hierarchy.
{1..10} < inexpensive
May 7, 2017
Data Mining: Concepts and Techniques
32
View of Warehouses and Hierarchies
May 7, 2017
Data Mining: Concepts and Techniques
33
OLAP operations in the Multidimensional data model.


Roll-up:
The
roll-up
operation
performs
aggregation on a data cube either by climbing up
a concept hierarchy for a dimension or by
dimension reduction.
Drill-down: It is the reverse of roll-up. It
navigates from less detailed data to more
detailed data. Drill-down can be realized by either
stepping down a concept hierarchy for a
dimension or introducing additional dimension
May 7, 2017
Data Mining: Concepts and Techniques
34
Fig. 3.10 Typical OLAP
Operations
May 7, 2017
Data Mining: Concepts and Techniques
35
OLAP operations in the Multidimensional data model.


Slice and dice: The slice operation performs a
selection on one dimension of the given cube,
resulting in a sub cube. The dice operation
defines a sub cube by performing a selection on
two or more dimensions.
Pivot(rotate): Pivot is a visualization operation
that rotate the data axes in view in order to
provide an alternative presentation of the data.
May 7, 2017
Data Mining: Concepts and Techniques
36

Other operations


May 7, 2017
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to
its back-end relational tables (using SQL)
Data Mining: Concepts and Techniques
37
A Star-Net Query Model
Each circle is
called a footprint
May 7, 2017
Data Mining: Concepts and Techniques
38
Chapter 3: Data Warehousing and
OLAP Technology: An Overview

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining
May 7, 2017
Data Mining: Concepts and Techniques
39
Data Warehouse Architecture
Four different views regarding the design of a data
warehouse must be considered:
 The top-down view
 The data source view
 The data warehouse view
 The business query view
May 7, 2017
Data Mining: Concepts and Techniques
40
Data Warehouse Architecture
The top-down view allows the selection of the
relevant information necessary for data
warehouse.
The data source view exposes the information
being captured, stored, and managed by
operational systems
The data warehouse view includes fact tables and
dimension tables. It represents the information
that is stored inside the data warehouse.
May 7, 2017
Data Mining: Concepts and Techniques
41
Data Warehouse Architecture
The business query view is the perspective of
data warehouse from the view point of the end
user.
.
May 7, 2017
Data Mining: Concepts and Techniques
42
The process of data warehouse design:
A data warehouse can be built using a top-down approach, or a
combination of both. The top-down approach starts with the
overall design and planning.
It is useful in cases where the technology is mature and wellknown, and where the business problems that must be solved
are clear and well understood.
The bottom up approach starts with experiments and
prototypes. This is useful in the early stage of business
modeling and technology development.
It allows an organization to move forward at considerably
less expense and to evaluate the benefits of the technology.
May 7, 2017
Data Mining: Concepts and Techniques
43

From software engineering point of view


May 7, 2017
Waterfall: structured and systematic
analysis at each step before proceeding
to the next
Spiral: rapid generation of increasingly
functional systems, short turn around
time, quick turn around
Data Mining: Concepts and Techniques
44
The warehouse design consists of the
following steps:
Choose a business process to model: If the business process is
organizational and involves multiple complex object collections, a data
warehouse model should be followed.
Choose the grain of the business process: The grain is the
fundamental, atomic level data to be represented in the fact table for this
process.
Choose the dimensions that will apply to each
fact table record.
Choose the measures that will populate each fact
table record.
May 7, 2017
Data Mining: Concepts and Techniques
45
Three-Tier data warehouse architecture
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
May 7, 2017
Data Storage
OLAP Engine Front-End Tools
Data Mining: Concepts and Techniques
46

Three-Tier data warehouse architecture
Data warehouses often adopt a three-tier architecture.
1.
The bottom tier is a warehouse database server that is almost always a
relational database system. Back-end tools and utilities are used to feed
data into the bottom tier from operational databases or other external
sources. These tools and utilities perform data extraction, cleaning, and
transformation.
2.
The middle tier is an OLAP server that is typically implemented
using either a relational OLAP(ROLAP) model, that is an extended
relational DBMS. Or a multidimensional OLAP(MOLAP) model, that is , a
special purpose server that directly implements multidimensional data and
operations.
3. The top tier is a front end client layer, which contains query and reporting
tools, analysis tools, data mining tools.
Three Data Warehouse Models
Enterprise Warehouse:
An enterprise warehouse collects all of the information about subjects
spanning the entire organization. It typically contains detailed data as well
as summarized data.
An enterprise data warehouse may be implemented on traditional
mainframes, computer supervisors or parallel architecture platforms.
Data Mart:
A data mart contains a subset of corporate wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects.
The implementation cycle of a data mart is more likely to be measured in
weeks rather than months or years. Dependant on the source of data ,data
marts can be categorized as independent or dependant.
Virtual Warehouse:
A virtual warehouse is a set of views over operational database. It is easy to
build but requires excess capacity on operational database servers.
May 7, 2017
Data Mining: Concepts and Techniques
49
Data Warehouse Back-end Tools and Utilities
Data extraction, which typically gathers data from
multiple, heterogeneous, and external sources
•Data cleaning, which detects errors in data and
rectifies when possible
•Data transformation, which converts data from
legacy or host format to warehouse format
•Load
which sorts, summarizes, consolidates,
computes views, checks integrity, and builds indices
and partitions
•Refresh, which propagates the updates from the data
sources to warehouse
•
Metadata Repository


Metadata are data about data. When used in data
warehouse, metadata are the data that defines the
warehouse objects
A metadata contains the following:



A description of the structure of the data warehouse which
includes the warehouse schema, view, dimensions, hierarchies
and derived data definitions as well as data mart locations and
contents
Operational metadata which includes data lineage( history of
migrated data an the sequence of transformation applied to it),
currency of data and monitoring information
The algorithms used for summarizations which include
measure and dimension definition algorithm, data on granularity,
partition, subject areas, aggregation, summarization, and
predefined queries and objects



The mapping from the operational environment to the
data warehouse, which includes source databases and their
contents, gateway descriptions, data partitions, data extraction,
cleaning, transformation rules and defaults, data refresh and
purging rules and security
Data related to system performance which includes indices
and profile that improve data access and retrieval performance
Business metadata, which include business terms and
definitions, data ownership information and charging policies
Types of OLAP Servers: ROLAP v/s MOLAP v/s
HOLAP


OLAP
servers
present
business
users
with
multidimensional data from data warehouses or data
marts without concerning regarding how or where the
data are stored.
Implementation of warehouse server for OLAP
processing include the following:


Relational OLAP(ROLAP)servers: These are the intermediate
servers that stand in between a relational back-end and client
front end tools. They use a relational or extended-RDBMS to
store and manage warehouse data
Multidimensional OLAP(MOLAP)servers: These servers support
multidimensional view of data through array based
multidimensional storage engines. They map multidimensional
views directly to data cube array structure.


Hybrid OLAP(HOLAP)servers: The hybrid OLAP combines ROLAP
and MOLAP technology, benefiting from the greater scalability of
ROLAP and the faster computation of MOLAP.
Specialized SQL Servers: To meet the growing demand of OLAP
processing in relational database system vendors implement
specialized SQL servers that provide advanced query language
and query process support for SQL queries over star and
snowflake schemas in read-only environment.
May 7, 2017
Data Mining: Concepts and Techniques
55
Chapter 3: Data Warehousing and
OLAP Technology: An Overview

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining
May 7, 2017
Data Mining: Concepts and Techniques
56
Data warehouse implementation


Data warehouses contain huge volumes of data.
OLAP servers demand that decision support
queries be answered in the order of second.
In SQL terms, these aggregations are referred to
as group-by’s . Each group-by can be
represented by a cuboid,where the set of groupby’s forms a lattice of cuboids defining a data
cube.
May 7, 2017
Data Mining: Concepts and Techniques
58
Cube Operation

Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales

Transform it into a SQL-like language (with a new operator
cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES

CUBE BY item, city, year
Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
()
May 7, 2017
Data Mining: Concepts and Techniques
59
Lattice of cuboids making a 3-D data cube
()
(city)
(city, item)
(item)
(city, year)
(year)
(item, year)
(city, item, year)
May 7, 2017
Data Mining: Concepts and Techniques
60
Data Cube = 2n

How many cuboids in an n-dimensional cube with L
levels?
n
T   ( Li 1)
i 1
May 7, 2017
Data Mining: Concepts and Techniques
61
Partial materialization:Selected
Computation of Cuboids



There are three choice for data cube materialization given a base
cuboid
No materialization
Do not precomputed any of the “nonbase” cuboids.
-computing expensive multidimensional aggregates on the fly .
Extremely slow
full materialization
precompute all of the cuboid. The resulting lattice of computed
cuboids is referred to as the full cube.
- huge amount of memory
Partial materialization
selectively compute a proper subset of the whole set of possible
cuboid.
Iceberg Cube

Computing only the cuboid cells whose
count or other aggregates satisfying the
condition like
HAVING COUNT(*) >= minsup

Motivation
 Only a small portion of cube cells may be “above the water’’ in a
sparse cube
 Only calculate “interesting” cells—data above certain threshold
 Avoid explosive growth of the cube
May 7, 2017
Data Mining: Concepts and Techniques
63
Indexing OLAP Data: Bitmap Index





Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for
the indexed column
not suitable for high cardinality domains
Base table
Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
May 7, 2017
Index on Region
Index on Type
Type RecIDAsia Europe America RecID Retail Dealer
Retail
1
1
0
1
1
0
0
Dealer 2
2
0
1
0
1
0
Dealer 3
1
0
0
3
0
1
4
0
0
1
4
1
0
Retail
0
1
0
5
0
1
Dealer 5
Data Mining: Concepts and Techniques
64
Indexing OLAP Data: Bitmap Index
May 7, 2017
Data Mining: Concepts and Techniques
65
Indexing OLAP Data: Join Indices



Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)
Traditional indices map the values to a list of record ids
 It materializes relational join in JI file and speeds up
relational join
In data warehouses, join index relates the values of the
dimensions of a start schema to rows in the fact table.
 E.g. fact table: Sales and two dimensions city and product
 A join index on city maintains for each distinct city a list
of R-IDs of the tuples recording the Sales in the city
 Join indices can span multiple dimensions
May 7, 2017
Data Mining: Concepts and Techniques
66
Indexing OLAP Data: Join Indices
May 7, 2017
Data Mining: Concepts and Techniques
67
May 7, 2017
Data Mining: Concepts and Techniques
68
May 7, 2017
Data Mining: Concepts and Techniques
69
Efficient Processing OLAP Queries

Determine which operations should be performed on the available cuboids

Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection

Determine which materialized cuboid(s) should be selected for OLAP op.

Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?

Explore indexing structures and compressed vs. dense array structs in MOLAP
May 7, 2017
Data Mining: Concepts and Techniques
70
Chapter 3: Data Warehousing and
OLAP Technology: An Overview

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining
May 7, 2017
Data Mining: Concepts and Techniques
71
Data Warehouse Usage

Three kinds of data warehouse applications

Information processing



Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining


May 7, 2017
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
Data Mining: Concepts and Techniques
72
From On-Line Analytical Processing to OnLine Analytical Mining
On-line analytical mining(OLAM) integrates
on-line analytical processing(OLAP) with data
mining and mining knowledge in multidimensional
databases. OLAM is particularly important for the
following reasons.
 High quality of data in data warehouses:
Most data mining tools need to work on
integrated, consistent, and cleaned data, which
requires costly data cleaning, data integration,
and data transformation as preprocessing steps.

Available information processing
infrastructure surrounding data warehouses:
Comprehensive information processing and data
analysis infrastructures have been or will be
systematically constructed surrounding data
warehouses.
ODBC, OLEDB, Web accessing, service facilities, reporting and
OLAP tools

OLAP-based exploratory data analysis:
Effective data mining needs exploratory data
analysis.
Mining with drilling, dicing, pivoting, etc.

On-line selection of data mining functions:
Integration and swapping of multiple mining functions,
algorithms, and tasks
An OLAM System Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
May 7, 2017
Data
Data integration Warehouse
Data Mining: Concepts and Techniques
Data
Repository
75
Chapter 3: Data Warehousing and
OLAP Technology: An Overview

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Summary
May 7, 2017
Data Mining: Concepts and Techniques
76
Summary
May 7, 2017
Data Mining: Concepts and Techniques
77
Summary
May 7, 2017
Data Mining: Concepts and Techniques
78
Summary
May 7, 2017
Data Mining: Concepts and Techniques
79
Summary
May 7, 2017
Data Mining: Concepts and Techniques
80
Summary
May 7, 2017
Data Mining: Concepts and Techniques
81