Download File

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Data Warehousing and OLAP Technology for Data
Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining
1
What is Data Warehouse?

Defined in many different ways



A decision support database that is maintained
separately from the organization’s operational
database
Support information processing by providing a
solid platform of consolidated, historical data for
analysis.
“A data warehouse is a subject-oriented,
integrated,
time-variant,
and
nonvolatile
collection of data in support of management’s
decision-making process.”—W. H. Inmon
2
Data Warehouse—Subject-Oriented

Organized around major subjects, such as customer,
product, sales.

Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.

Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
3
Data Warehouse—Integrated


Constructed by integrating multiple, heterogeneous
data sources
 relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
 When data is moved to the warehouse, it is
converted.
4
Data Warehouse—Time Variant

The time horizon for the data warehouse is significantly
longer than that of operational systems.



Operational database: current value data.
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse


Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain “time element”.
5
Data Warehouse—Non-Volatile

A physically separate store of data transformed from the
operational environment.

Operational update of data does not occur in the data
warehouse environment.

Does not require transaction processing, recovery,
and concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data.
6
Data Warehouse vs. Heterogeneous DBMS

Many Organizations may collect diverse kinds of Heterogeneous Dbases such
as Multiple, Autonomous and Distributed Information sources

Integration of these are must for easy and efficient access

Traditional heterogeneous DB integration:

Build wrappers/mediators on top of heterogeneous databases

Query driven approach



When a query is posed to a client site, a meta-dictionary is used to translate
the query into queries appropriate for individual heterogeneous sites involved,
and the results are integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance and stored in
warehouses for direct query and analysis
7
Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing)




Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making
Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries
8
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
9
Why Separate Data Warehouse?



We know that Operational Databases store huge
amounts of data
Why not perform ILAP directly on such Dbases instead
of spending additional time and resources to construct
a separate data warehouse?
The Major reason is…
10
Why Separate Data Warehouse?


High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
 missing data: Decision Support requires historical data
which operational DBs do not typically maintain
 data
consolidation:
DS requires consolidation
(aggregation,
summarization)
of
data
from
heterogeneous sources
 data
quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
11
Data Warehousing and OLAP Technology for
Data Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining
12
Data Warehousing and OLAP Technology for
Data Mining




A data warehouse is based on a multi-dimensional data model
which views data in the form of a data cube
Data Cube allowed to be modeled multiple dimensions
A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
13
Cube: A Lattice of Cuboids
all
time
time,item
Highest Level of Summarization 0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
14
Conceptual Modeling of
Data Warehouses

The most popular data model for a data warehouse is a
Multidimensional Model

There are different forms

Star schema

Snowflake schema

Fact constellations schema
15
Star Schema


This is the most common modeling paradigm

It contains a large central table called fact table ,
which containing the bulk of data with no redundancy

And it has a set of smaller attendant tables called
dimension tables for each dimension
16
Example of Star Schema
Supplier / Location are redundant
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
Attributes such as
location_key
street
city
province_or_street
country
Measures
17
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
18
Snowflake schema

A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
19
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
20
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street,
city(city_key, province_or_state, country))
21
Fact constellations

Sophisticated applications may require multiple fact
tables to share dimension tables

It can be viewed as a collection of stars, therefore
called galaxy schema or fact constellation

Two fact tables : sales and shipping
22
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type 23
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
24
Data Warehouse vs Data Mart

DWarehouse
collects
the
information
about
the
subjects of the entire organizations


Fact constellation should be used
DMart focuses on selected subjects like departmentwide

Star or Snowflake can be used
25
Multidimensional Data

Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month Week
Day
Month
26
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
sum
27
28
Data Preprocessing

Data Cleaning

Data Integration and Transformation and

Data Reduction
29
Why Data Preprocessing?


Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of
quality data
30
Multi-Dimensional Measure of Data Quality


A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
Broad categories:
 intrinsic, contextual, representational, and
accessibility.
31
Major Tasks in Data Preprocessing

Data Cleaning


Data Integration


Integration of multiple databases, data cubes, or files
Data Transformation


Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Normalization and aggregation
Data Reduction

Obtains reduced representation in volume but produces the
same or similar analytical results

32
Forms of data preprocessing
33
Data Cleaning

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data
34
Missing Data

Data is not always available


Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding



E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred.
35
How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter

Use the most probable value to fill in the missing value: inferencebased such as Bayesian formula or decision tree
36
Noisy Data



Noise: random error or variance in a measured variable
Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
37
How to Handle Noisy Data?/ Smoothing




Binning method( for Data Transformation)
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Clustering
 detect and remove outliers
Combined computer and human inspection
 detect suspicious values and check by human
Regression
 smooth by fitting the data into regression functions
38
Cluster Analysis
39
Regression
y
Y1
Y1’
y=x+1
X1
x
40
Data Integration



Data integration:
 combines data from multiple sources into a coherent
store
Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts
 for the same real world entity, attribute values from
different sources are different
 possible reasons: different representations, different
scales, e.g., metric vs. British units
41
Handling Redundant Data
in Data Integration

Redundant data occur often when integration of multiple
databases




The same attribute may have different names in
different databases
One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Redundant data may be able to be detected by
correlational analysis
Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
42
Data Transformation



Smoothing: remove noise from data ( binning,
clustering, Regression)
Aggregation: Daily sales data may be aggreagated to
so as to compute monthly / yearly

Which is used for analysis

ie summarization or data cube construction
Generalization: low-level data replaced by higher-level
data

Street to city/country

Age to like young, middle-age or senior
43
Data Transformation

Normalization: scaled to fall within a small, specified
range



-1.0 to 1.0 or 0.0 to 1.0
There are three types

Min-Max

Z-score and

Normalization by decimal scaling
Attribute/feature construction

New attributes constructed from the given ones tp
help the mining process
44
Data Transformation:
Normalization

min-max normalization
v  minA
v' 
(new _ maxA  new _ minA)  new _ minA
maxA  minA

z-score normalization

normalization by decimal scaling
v  meanA
v' 
stand _ devA
v
v'  j
10
Where j is the smallest integer such that Max(| v ' |)<1
45
Data Reduction Strategies



Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
Data reduction
 Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
Data reduction strategies
 Data cube aggregation
 Dimensionality reduction
 Data Compression
 Numerosity reduction
46
Data Cube Aggregation

Minimizing information from Multidimensional
analysis of sales
47
Dimensionality Reduction
Data sets for analysis may contain hundreds of attributes
 Many of which may be irrelavant to the mining task or
redundant
 It reduces the data set size by removing such attributes
from it
Reduction Techniques
 step-wise forward selection – procedure starts with empty set of

attributes and the best attribute is added to the set



step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction- Tree constructed with the given data.
If
Attributes that do not appear in the tree are declared irrelevant
48
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?
A1?
Class 1
>
Class 2
Class 1
Class 2
Reduced attribute set: {A1, A4, A6}
49
Data Compression


String compression
 There are extensive theories and well-tuned
algorithms
 Typically lossless
 But only limited manipulation is possible
without expansion
Audio/video compression
 Typically lossy compression, with progressive
refinement
 Sometimes small fragments of signal can be
reconstructed without reconstructing the
50
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
51
Numerosity Reduction

Parametric methods



Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
Log-linear models: obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods

Do not assume models

Major families: histograms, clustering, sampling
52
Regression and Log-Linear Models

Linear regression: Data are modeled to fit a straight line


Often uses the least-square method to fit the line
Multiple regression: allows a response variable Y to be
modeled as a linear function of multidimensional feature
vector

Log-linear model: approximates discrete
multidimensional probability distributions
53
Regress Analysis and LogLinear Models



Linear regression: Y =  +  X
 Two parameters ,  and  specify the line and are to
be estimated by using the data at hand.
 using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed into the
above.
Log-linear models:
 The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
 Probability: p(a, b, c, d) = ab acad bcd
Histograms




A popular data reduction 40
technique
35
Divide data into buckets 30
and store average (sum)
25
for each bucket
Can be constructed
20
optimally in one
dimension using dynamic15
programming
10
Related to quantization 5
problems.
0
10000
30000
50000
70000
90000
55
Clustering

Partition data set into clusters, and one can store cluster
representation only

Can be very effective if data is clustered but not if data
is “smeared”

Can have hierarchical clustering and be stored in multidimensional index tree structures

There are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8
56
Sampling




Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or
subpopulation of interest) in the overall database
 Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).
57
Sampling
Raw Data
58
Sampling
Raw Data
Cluster/Stratified Sample
59
OLAP
What is On-Line Analytical Processing - OLAP?
• OLAP is technology that uses a multidimensional view of
aggregate data to provide quick access to strategic information for
further analysis.
• Allows fast and efficient analysis of data by turning raw data into
information that can be understood by users and manipulated in
various ways.
• OLAP applications require multidimensional views of data, and
usually, calculation-intensive capabilities and time intelligence.
60
OLAP
Need for OLAP
• To solving Modern Business Problems
• Market Analysis
• Financial Forecasting
• Multidimensional Data Model
OLAP Guidelines ie it should support for
• Transparency
• Accessibility
• Consistent Reporting Performance
• Client/Server Architecture
• Multiuser Support
• Flexible Reporting
• SQL Interface, Incremental Dbase Refresh
61
OLAP
Multidimensional versus MultiRelational OLAP
•
Multidimensional Dbase Systems are referred to as Multirelational
Systems
• ie to increase the speed to extract the knowledge, the Star or
Snowflake schemas are used
62
OLAP
Categorization of OLAP Tools
•
•
•
•
It uses Multidimensional Dbase MDDB
MDDBASES + OLAP = MOLAP
RDBMS + OLAP = ROLAP
Managed Query Environment MQE or HYBRID Architecture
63