Download Introduction to Database Systems

Document related concepts

Pattern recognition wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Data Warehousing/Mining
Comp 150
Additional Information
Instructor: Dan Hebert
Data Warehousing/Mining
1
The Need for Data Analysis

Constant pressure from external and internal forces
requires prompt tactical and strategic decisions.

The decision-making cycle time is reduced, while
problems are increasingly complex with a growing
number of internal and external variables.

Managers need support systems for facilitating quick
decision making in a complex environment.

Decision support systems (DSS).
Data Warehousing/Mining
2
Data Warehousing/Mining
3
Decision Support Systems

Decision Support is a methodology (or a series of
methodologies) designed to extract information from
data and to use such information as a basis for
decision making.

A decision support system (DSS) is an arrangement
of computerized tools used to assist managerial
decision making within a business.
– A DSS usually requires extensive data “massaging” to
produce information.
– The DSS is used at all levels within an organization and is
often tailored to focus on specific business areas or
problems.
– The DSS is interactive and provides ad hoc query tools to
retrieve data and to display data in different formats.
Data Warehousing/Mining
4
Decision Support Systems

Four Components of a DSS
– The data store component is basically a DSS
database.
– The data extraction and filtering component is
used to extract and validate the data taken from
the operational database and the external data
sources.
– The end user query tool is used by the data
analyst to create the queries that access the
database.
– The end user presentation tool is used by the
data analyst to organize and present the data.
Data Warehousing/Mining
5
Main Components Of A Decision Support System (DSS)
Figure 13.1
Data Warehousing/Mining
6
Decision Support Systems

Operational Data vs. Decision Support Data
– Most operational data are stored in a relational
database in which the structures tend to be
highly normalized.
– The operational data storage is optimized to
support transactions that represent daily
operations.
– Whereas operational data capture daily business
transactions, DSS data give tactical and strategic
business meaning to the operational data.
Data Warehousing/Mining
7
Decision Support Systems

Three Main Areas in Which DSS Data Differ
from Operational Data
– Time span


Operational data represent current (atomic)
transactions.
DSS data tend to cover a longer time frame.
– Granularity


Operational data represent specific transactions that
occur at a given time.
DSS data must be presented at different levels of
aggregation.
– Dimensionality

Data Warehousing/Mining

Operational data focus on representing atomic
transactions.
DSS data can be analyzed from multiple dimensions.
8
Data Warehousing/Mining
9
Table 13.2 Contrasting Operational And DSS Data Characteristics
Data Warehousing/Mining
10
Decision Support Systems

The DSS Database Requirements
– Database Schema


The DSS database schema must support complex (nonnormalized) data representations.
The queries must be able to extract multidimensional
time slices.
Data Warehousing/Mining
11
Ten Year Sales History For A Single Department,
Millions Of Dollars
Table 13.3
Data Warehousing/Mining
12
Yearly Sales Summaries, Two Stores and Two Departments
Per Store, Millions Of Dollars
Table 13.4
Data Warehousing/Mining
13
Decision Support Systems
– Data Extraction and Loading
 The
DBMS must support advanced data
extracting and filtering tools.
 The data extraction capabilities should
support different data sources and multiple
vendors.
 Data filtering capabilities must include the
ability to check for inconsistent data or data
validation rules.
 The DBMS must support advanced data
integration, aggregation, and classification
capabilities.
Data Warehousing/Mining
14
Yearly Sales Summaries, 20 Stores, 10 Departments Per Store,
Millions Of Dollars
Table 13.5
Data Warehousing/Mining
15
Decision Support Systems
– End-User Analytical Interface


The DSS DBMS must support advanced data
modeling and data presentation tools, data
analysis tools, and query generation and
optimization components.
The end user analytical interface is one of the
most critical components.
– Database Size Requirements



Data Warehousing/Mining
DSS databases tend to be very large.
The DBMS must be capable of supporting very
large databases (VLDB).
The DBMS may be required to use advanced
hardware, such as multiple disk arrays and
multiple-processor technologies.
16
The Data Warehouse

The Data Warehouse is an integrated, subjectoriented, time-variant, non-volatile database
that provides support for decision making.
– Integrated

The Data Warehouse is a centralized, consolidated
database that integrates data retrieved from the entire
organization.
– Subject-Oriented

The Data Warehouse data is arranged and optimized to
provide answers to questions coming from diverse
functional areas within a company.
Data Warehousing/Mining
17
The Data Warehouse
– Time Variant
 The
Warehouse data represent the flow of
data through time. It can even contain
projected data.
– Non-Volatile
 Once
data enter the Data Warehouse, they are
never removed.
 The
Data Warehousing/Mining
Data Warehouse is always growing.
18
Table 13.6A Comparison Of Data Warehouse And Operational
Database Characteristics
Data Warehousing/Mining
19
Creating A Data Warehouse
Figure 13.3
Data Warehousing/Mining
20
The Data Warehouse

Data Mart
– A data mart is a small, single-subject data
warehouse subset that provides decision support
to a small group of people.
– Data Marts can serve as a test vehicle for
companies exploring the potential benefits of
Data Warehouses.
– Data Marts address local or departmental
problems, while a Data Warehouse involves a
company-wide effort to support decision making
at all levels in the organization.
Data Warehousing/Mining
21
DSS Architectural Styles
Table 13.7
Data Warehousing/Mining
22
The Data Warehouse

Twelve Rules That Define a Data Warehouse
1. The Data Warehouse and operational
environments are separated.
2. The Data Warehouse data are integrated.
3. The Data Warehouse contains historical data over a
long time horizon.
4. The Data Warehouse data are snapshot data
captured at a given point in time.
5. The Data Warehouse data are subject-oriented.
6. The Data Warehouse data are mainly read-only
with periodic batch updates from operational data.
No online updates are allowed.
7. The Data Warehouse development life cycle differs
from classical systems development. The Data
Warehouse development is data driven; the
classical approach is process driven.
Data Warehousing/Mining
23
The Data Warehouse
8. The Data Warehouse contains data with several
levels of detail; current detail data, old detail data,
lightly summarized, and highly summarized data.
9. The Data Warehouse environment is characterized
by read-only transactions to very large data sets.
The operational environment is characterized by
numerous update transactions to a few data entities
at the time.
10. The Data Warehouse environment has a system
that traces data resources, transformation, and
storage.
11. The Data Warehouse’s metadata are a critical
component of this environment. The metadata
identify and define all data elements. The metadata
provide the source, transformation, integration,
storage, usage, relationships, and history of each
data element.
12. The Data Warehouse contains a charge-back
mechanism for resource usage that enforces
optimal use of the data by end users.
Data Warehousing/Mining
24
On-Line Analytical Processing

On-Line Analytical Processing (OLAP) is an
advanced data analysis environment that
supports decision making, business
modeling, and operations research activities.

Four Main Characteristics of OLAP
– Use multidimensional data analysis techniques
– Provide advanced database support
– Provide easy-to-use end user interfaces
– Support client/server architecture
Data Warehousing/Mining
25
On-Line Analytical Processing

Multidimensional Data Analysis Techniques
– The processing of data in which data are viewed
as part of a multidimensional structure.
– Multidimensional view allows end users to
consolidate or aggregate data at different levels.
– Multidimensional view allows a business
analyst to easily switch business perspectives.
– Figure 13.4
Data Warehousing/Mining
26
Figure 13.4 Operational Vs. Multidimensional View Of Sales
Data Warehousing/Mining
27
On-Line Analytical Processing

Additional Functions of Multidimensional
Data Analysis Techniques
– Advanced data presentation functions
– Advanced data aggregation, consolidation, and
classification functions
– Advanced computational functions
– Advanced data modeling functions
Data Warehousing/Mining
28
Figure 13.5 Integration Of OLAP With A Spreadsheet Program
Data Warehousing/Mining
29
On-Line Analytical Processing

Advanced Database Support
– Access to many different kinds of DBMSs, flat files, and
internal and external data sources.
– Access to aggregated Data Warehouse data as well as to the
detail data found in operational databases.
– Advanced data navigation features such as drill-down and
roll-up.
– Rapid and consistent query response times.
– The ability to map end user requests, expressed in either
business or model terms, to the appropriate data source
and then to the proper data access language.
– Support for very large databases.
Data Warehousing/Mining
30
On-Line Analytical Processing

Easy-to-Use End User Interface
– Easy-to-use graphical user interfaces make
sophisticated data extraction and analysis tools
easily accepted and readily used.

Client/Server Architecture
– The client/server environment enables us to
divide an OLAP system into several components
that define its architecture.
Data Warehousing/Mining
31
On-Line Analytical Processing
 OLAP
Architecture
– Three Main Modules
 OLAP
Graphical User Interface (GUI)
 OLAP
Analytical Processing Logic
 OLAP
Data Processing Logic
– OLAP systems are designed to use both
operational and Data Warehouse data.
Data Warehousing/Mining
32
Figure 13.7 OLAP Server Arrangement
Data Warehousing/Mining
33
Figure 13.8 OLAP Server With Multidimensional Data Store Arrangement
Data Warehousing/Mining
34
Figure 13.9 OLAP Server With Local Mini Data-Marts
Data Warehousing/Mining
35
On-Line Analytical Processing

Relational OLAP
– Relational On-Line Analytical Processing
(ROLAP) provides OLAP functionality by using
relational database and familiar relational query
tools.
– Extensions to RDBMS



Multidimensional data schema support within the
RDBMS
Data access language and query performance
optimized for multidimensional data
Support for very large databases
Data Warehousing/Mining
36
On-Line Analytical Processing
– Multidimensional Data Schema Support within
the RDBMS




Normalization of tables in relational technology is seen
as a stumbling block to its use in OLAP systems.
DSS data tend to be non-normalized, duplicated, and
pre- aggregated.
ROLAP uses a special design technique to enable
RDBMS technology to support multidimensional data
representations, known as star schema.
Star schema creates the near equivalent of a
multidimensional database schema from the existing
relational database.
Data Warehousing/Mining
37
On-Line Analytical Processing
– Data Access Language and Query Performance Optimized
for Multidimensional Data


Most decision support data requests require the use of
multiple-pass SQL queries or multiple nested SQL
statements.
ROLAP extends SQL so that it can differentiate between
access requirements for data warehouse data and operational
data.
– Support for Very Large Databases



Decision support data are normally loaded in bulk (batch)
mode from the operational data.
RDBMS must have the proper tools to import, integrate, and
populate the data warehouse with operational data.
The speed of the data-loading operations is important.
Data Warehousing/Mining
38
Figure 13.10 A Typical ROLAP Client/Server Architecture
Data Warehousing/Mining
39
On-Line Analytical Processing

Multidimensional OLAP (MOLAP)
– MOLAP extends OLAP functionality to
multidimensional databases (MDBMS).
– MDBMS end users visualize the stored data as a
multidimensional cube known as a data cube.
– Data cubes are created by extracting data from the
operational databases or from the data warehouse.
– Data cubes are static and require front-end design
work.
– To speed data access, data cubes are normally held
in memory, called cube cache.
– MOLAP is generally faster than their ROLAP
counterparts. It is also more resource-intensive.
– MDBMS is best suited for small and medium data
sets.
– Multidimensional data analysis is also affected by
how the database system handles sparsity.
Data Warehousing/Mining
40
MOLAP Client/Server Architecture
Figure 13.11
Data Warehousing/Mining
41
Relational Vs. Multidimensional OLAP
Table 13.8
Data Warehousing/Mining
42
Star Schema

The star schema is a data-modeling technique used to
map multidimensional decision support into a
relational database.

Star schemas yield an easily implemented model for
multidimensional data analysis while still preserving
the relational structure of the operational database.

Four Components:
–
–
–
–
Facts
Dimensions
Attributes
Attribute hierarchies
Data Warehousing/Mining
43
A Simple Star Schema
Figure 13.12
Data Warehousing/Mining
44
Star Schema

Facts
– Facts are numeric measurements (values) that represent a
specific business aspect or activity.
– The fact table contains facts that are linked through their
dimensions.
– Facts can be computed or derived at run-time (metrics).

Dimensions
– Dimensions are qualifying characteristics that provide
additional perspectives to a given fact.
– Dimensions are stored in dimension tables.
Data Warehousing/Mining
45
Star Schema

Attributes
– Each dimension table contains attributes. Attributes
are often used to search, filter, or classify facts.
– Dimensions provide descriptive characteristics
about the facts through their attributes.
Table 13.9 Possible Attributes For Sales Dimensions
Data Warehousing/Mining
46
Three Dimensional View Of Sales
Figure 13.13
Data Warehousing/Mining
47
Slice And Dice View Of Sales
Figure 13.14
Data Warehousing/Mining
48
Star Schema

Attribute Hierarchies
– Attributes within dimensions can be ordered in a
well-defined attribute hierarchy.
– The attribute hierarchy provides a top-down data
organization that is used for two main purposes:

Aggregation

Drill-down/roll-up data analysis
Data Warehousing/Mining
49
A Location Attribute Hierarchy
Figure 13.15
Data Warehousing/Mining
50
Attribute Hierarchies In Multidimensional Analysis
Figure 13.16
Data Warehousing/Mining
51
Star Schema

Star Schema Representation
– Facts and dimensions are normally represented
by physical tables in the data warehouse
database.
– The fact table is related to each dimension table
in a many-to-one (M:1) relationship.
– Fact and dimension tables are related by foreign
keys and are subject to the primary/foreign key
constraints.
Data Warehousing/Mining
52
Figure 13.17 Star Schema For Sales
Data Warehousing/Mining
53
Figure 13.18 Orders Star Schema
Data Warehousing/Mining
54
Star Schema
 Performance-Improving
Techniques
– Normalization of dimensional tables
– Multiple fact tables representing
different aggregation levels
– Denormalization of fact tables
– Table partitioning and replication
Data Warehousing/Mining
55
Figure 13.19 Normalized Dimension Tables
Data Warehousing/Mining
56
Figure 13.20
Multiple Fact Tables
Data Warehousing/Mining
57
Data Warehouse Implementation

The Data Warehouse as an Active Decision
Support Network
– A Data Warehouse is a dynamic support
framework.
– Implementation of a Data Warehouse is part of a
complete database-system-development
infrastructure for company-wide decision
support.
– Its design and implementation must be
examined in the light of the entire infrastructure.
Data Warehousing/Mining
58
Data Warehouse Implementation

A Company-Wide Effort that Requires User
Involvement and Commitment at All Levels
– For a successful design and implementation, the
designer must:

Involve end users in the process.

Secure end users’ commitment from the beginning.

Create continuous end user feedback.

Manage end user expectations.

Establish procedures for conflict resolution.
Data Warehousing/Mining
59
Data Warehouse Implementation

Satisfy the Trilogy: Data, Analysis, and Users
– For a successful design and implementation, the
designer must satisfy:



Data integration and loading criteria.
Data analysis capabilities with acceptable query
performance.
End user data analysis needs.
Data Warehousing/Mining
60
Data Warehouse Implementation

Apply Database Design Procedures
– Data Warehouse development is a companywide effort and requires many resources: people,
financial, and technical.



The sheer and often mind-boggling quantity of
decision support data is likely to require the latest
hardware and software.
It is also imperative to have very detailed procedures to
orchestrate the flow of data from the operational
databases to the Dare Warehouse.
To implement and support the Data Warehouse
architecture, we also need people with advanced
database design, software integration, and
management skills.
Data Warehousing/Mining
61
Data Warehouse Implementation Road Map
Figure 13.21
Data Warehousing/Mining
62
Data Mining

In contrast to the traditional (reactive) DSS tools, the
data mining premise is proactive.

Data mining tools automatically search the data for
anomalies and possible relationships, thereby
identifying problems that have not yet been
identified by the end user.

Data mining tools -- based on algorithms that form
the building blocks for artificial intelligence, neural
networks, inductive rules, and predicate logic -initiate analysis to create knowledge.
Data Warehousing/Mining
63
Extraction Of Knowledge From Data
Figure 13.22
Data Warehousing/Mining
64
Data Mining

Four Phases of Data Mining
1. Data Preparation


Identify and cleanse data sets.
Data Warehouse is usually used for data mining
operations.
2. Data Analysis and Classification

Identify common data characteristics or patterns using
– Data groupings, classifications, clusters, or sequences.
– Data dependencies, links, or relationships.
– Data patterns, trends, and deviations.
Data Warehousing/Mining
65
Data Mining
3. Knowledge Acquisition


Select the appropriate modeling or knowledge
acquisition algorithms.
Examples: neural networks, decision trees, rules
induction, genetic algorithms, classification and
regression tree, memory-based reasoning, or nearest
neighbor and data visualization.
4. Prognosis

Predict future behavior and forecast business outcomes
using the data mining findings.
Data Warehousing/Mining
66
Data-Mining Phases
Figure 13.23
Data Warehousing/Mining
67
A Sample Of Current Data Warehousing And
Data Mining Vendors
Table 13.10
Data Warehousing/Mining
68