Download Data Warehousing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Functional Database Model wikipedia , lookup

Transcript
Chapter 3
Database Support in Data Mining
Types of database systems
How relate to data mining
結束
Contents
Describes data warehousing and related database system.
Discusses feature of data found in data warehouse
Describes how data warehouses are typically implemented
and operated
Defines metadata in the context of data warehouses
Show how different data systems are typically used in data
mining
Provides real examples of database systems used in data
mining
Discusses the concept of data quality
Reviews the database software market
3-2
結束
Data management
Retail organization generate masses of data that require
very advanced data storage system.
Wal-Mart relied on modern data management to
engage with SCM.
The manipulation of data is a key element in the data
mining process.
Data mining and other analysis can draw upon data
collected in internal systems and external sources.
3-3
結束
Data access
Data warehouses are not requirements to do data
mining, data warehouses store massive amounts of data
that can be used for data mining.
Data mining analyses also use smaller sets of data that
can be organized in online analytic processing (OLAP)
systems of in data mining.
OLAP: provides access to report generators and
graphical support.
3-4
結束
Contemporary Database
Gain competitive advantage
customer information systems
data mining
Develop and market new products
micromarketing
3-5
結束
Systems
Database
Personal, small business level
On-Line Analytic Processing (OLAP)
Ability to use many dimensions, reports & graphics
Data Mart
Usually temporary analysis
Data Warehouse
Usually permanent repository
3-6
結束
Data Warehousing
Price Waterhouse definition:
A data warehouse is an orderly and accessible
repository of known facts and related data that
is used as a basis for making better
management decisions. The data warehouse
provides a unified repository of consistent data
for decision making that is subject oriented,
integrated, time variant, and nonvolatile.
3-7
結束
Data Warehousing
Data warehouses are used to store massive quantities of data
that can be updated and allow quick retrieval of specific
types of data.
Not just a technology; an architecture and process designed
to support decision making
special-purpose database systems to improve query
performance significantly
Three general data warehouse processes:
1.
2.
3.
warehouse generation is the process of designing the
warehouse and loading the data.
Data management is the process of storing the data.
Information analysis is the process of using the data to support
organization decision making.
3-8
結束
Benefits from Data Warehousing
Provide business users views of data appropriate
to mission
Consolidate & reconcile (consistent) data
Give macro views of critical aspects
Timely & detailed access to information
Provide specific information to particular
groups
Ability to identify trends
3-9
結束
Data warehousing
Within data warehouses, data is classified and
organized around subjects meaningful to the
company.
The data is gathered from operational systems:
Barcode readers at cash registers,
Information from e-commerce,
Daily reports…
Industry volumes
Economic data..
Data from different sources (shipping, marketing,
billing) are integrated into a common format.
3-10
結束
Data Transformation
Consolidate data from multiple sources
Filter to eliminate unnecessary details
Clean data
eliminate incorrect entries
eliminate duplications
Convert & translate data into proper format
Aggregate data as designed
3-11
結束
Data warehousing
A data warehouse is a central aggregation of data,
intended as a permanent storage facility with
normalized, formatted.
Normalized implies the use of small, stable data
structure within the database. Normalized data would
group data elements by category, making it possible to
apply relational principles in data updating.
3-12
結束
Key Concepts
Scalability
Ability to accurately cope with changing
conditions (especially magnitude of computing)
Granularity
Level of detail
Data warehouse – tends to be fine granularity
OLAP – tends to aggregate to coarse granularity
3-13
結束
Data Warehousing
OLAP
summary data
few users
data driven
effectiveness
On-Line Transactional Processing
detailed operational data
many concurrent users
transaction driven
efficiency
use spreadsheets to access
3-14
結束
Data Marts
Intermediate-level database system
Originally, many data marts were marketed as preliminary
data warehouses. Currently, many data marts are used in
conjunction with data warehouses rather than as
competitive products.
Data marts are usually used as repositories of data
gathered to serve a particular set of users, providing data
extracted from data warehouses and/or other sources.
Often used as temporary storage
 Gather data for study from data warehouse, other sources
(including external)
 Clean & transform for data mining
3-15
結束
OLAP
Multidimensional spreadsheet approach to shared data
storage designed to allow users to extract data and
generate report on the dimensions important to them.
Data is segregated into different dimensions and
organized in a hierarchical manner.
Hypercube – term to reflect ability to sort on many
dimensional forms
Many forms
 MOLAP – multidimensional
 ROLAP – relational (uses SQL)
 DOLAP – desktop
 WOLAP – web enabled
 HOLAP - hybrid
3-16
結束
OLAP
One function of OLAP is standard report generation,
including financial performance analysis on selected
dimensions (such as by department, geographical
region, product, salesperson, time…).
Supporting the planning and forecasting projects using
spreadsheet analytic tools.
An OLAP product including a data warehouse, an
OLAP server, and a client server on a local area
network (LAN).
OLAP functions – see page. 37
3-17
結束
Relationships of database and DM
Data warehouses are not
required for data mining,
nor are OLAP system.
However, the existence
of either presents many
opportunities to data
mining.
3-18
結束
Data Warehouse Implementation
Data warehouses create the opportunity to provide
much better information than what was available in
the past. DW can produce consistent views of events
and reports.
DW provides Reliable, comprehensive source of clean
data
 Accurate, complete, in correct format
Processes
 System development
 Data acquisition
 Data extraction for use
3-19
結束
Data Warehouse Implementation
Implementing processes involve a degree of
continuity since data warehousing is a dynamic
environment.
To have a suite of software tools to extract data
from sources and move it to the data warehouse
itself and provide user access to this information.
Data acquisition is supported data warehouse
generation.
3-20
結束
Data Warehouse Generation
Extract data from sources
Transform
Clean
Load into data warehouse
60-80% of effort in operating data warehouse
3-21
結束
Data Extraction Routines
Extraction programs are executed periodically
to obtain records, and copy the information to
an intermediate file.
Data extraction routines:
Interpret data formats
Identify changed records
Copy information to intermediate file
3-22
結束
Data Transformation
Transformation programs accomplish final data
preparation, including:
The consolidation of data from multiple sources
Filtering data to eliminate unnecessary details
Cleaning data eliminate incorrect entries of duplications
Converting and translating data into the format
established for the data warehouse
The aggregation of data
3-23
結束
Data Management
Data Management involve in:
Retrieve information from data warehouse
Run extraction programs to generate
repetitive reports and serve specific needs
Implementation Problems:
Required data not available
Initial data warehouse scope too broad
Not enough time to do prototyping, or needs
analysis
Insufficient senior direction
3-24
結束
Meta Data
Data warehouse management vs. data
management:
 Data management concerns the management of all of the
enterprise’s data.
 Data warehouse management refers to the designs and
operation of the data warehouse through all phases of its
life cycle.
 Manage meta data
 Design data warehouse
 Ensure data quality
 Manage system during operations
3-25
結束
Meta Data
Metadata is the set of reference (Data) to keep track of
data, and is used to describe the organization of the
warehouse.
A data catalog provides users with the ability to see
specifically what the data warehouse contains.
The content of the data warehouse is defined by
metadata, which provides business views of data
(information access tools) and technical views
(warehouse generation tools).
3-26
結束
Business Metadata
What data are available
Source of each data element
Frequency of data updates
Location of specific data
Predefined reports & queries
Methods of data access
3-27
結束
Technical Meta Data
Data source
(internal or external)
Data preparation features
(transformation & aggregation rules)
Logical structure of data
Physical structure & content
Data ownership
Security aspects
(access rights, restrictions)
System information
(date of last update, retention policy, data usage)
3-28
結束
Wal-Mart’s Data Warehouse
Heavy user of IT
Core competency – supply chain distribution
2900 outlets
Data warehouse of 101 terabytes ($4 billion)
65 million transactions per week
Subject-oriented, integrated, time-variant, nonvolatile
data
65 weeks of data by item, store, day
3-29
結束
Wal-Mart
Use data warehouse to:
Support decision making
Buyers, merchandisers, logistics, forecasters
3,500 vendor partners can query
Can handle 35 thousand queries per week
Benefit $12,000 per query
Some users about 1 thousand queries per day
3-30
結束
Summers Rubber Company
Distribution firm
7 operating locations
10,000 items
3,000 customers
Old system:
OLAP
Databases transactional & summarized,
distributed
3-31
結束
Summers Data Storage System
Built in-house, PCs, Access database
Visual Basic & Excel
Distributed system
Data warehouse server controlled queries, managed
resources
Security
Passwords gave some protection
To protect from leaving employees, used data marts
with small versions of central database
3-32
結束
Summers – Negative features
Too much disk space on user local drives
Often difficult to understand & use
Updating multiple data sites slow, limited access
Summary data often wrong
Couldn’t use data mining tools
Problem was aggregated data stored
3-33
結束
Comparison
Product
Use
Duration
Granularity
Warehouse
Repository
Permanent
Finest
Mart
Specific
study
Temporary
Aggregate
OLAP
Report &
analysis
Repetitive
Summary
3-34
結束
Examples of Data Uses
Customer information systems
Fingerhut
3-35
結束
Customer Information Systems
Massive databases
Detailed information about individuals and
households
Use automated analysis
identify focused market target
3-36
結束
Micromarketing
Target small groups of highly responsive
customers
Own niches like smaller competitors
EXAMPLES:
Great Atlantic & Pacific Tea Company (A&P)
target customers, centralize buying
Fingerhut
sell on credit to households <$25,000 income
3-37
結束
System demonstrations
A dealer wholesaler.
A small portion for the first 10 shipments (Table. 3.1).
Data warehouse are normalized into relational form. The data is
organized into a series of tables connected by keys.
Revenue
3-38
結束
Data mart
Examining the characteristics of customers who buy the
products. (Advertising by mail, internet, …)
Data marts could extract the data and aggregate it in a form
useful for data mining.
Table 3.2 shows entries that might be found in a data mart. (on
product D428 in two-year interval)
3-39
結束
OLAP
An OLAP application focuses more on analyzing trends or other
aspects of organizational operations. It may obtain much of its
information from the data warehouse, but extracts granular
information.
This information could be accessed to make a report by product
category. Table. 3.3.
positive
3-40
結束
OLAP
Evaluating the value of each client to the firm.
Data can be aggregated within data mart, or on an
OLAP system.
3-41
結束
OLAP
Organizing volume according to the shipper.
Table 3.5 displays the results of cases by shipper for
each shipper.
3-42
結束
Data Quality
Data warehouse projects can fail, one of the most common
reason is the refusal (reject) of users to accept the validity of
data obtained from a data warehouse. Because:
 The corruption of data or missing data from the original sources.
 Failure of the software transferring data into or out of the data
warehouse.
 Failure of the data-cleansing process to resolve data inconsistence.
The responsible staff must verify the integrity of data, ensuring
the data loading and storing process.
Data Integrity: Do not allow any meaningless, corrupt, or
redundant data into the data warehouse.
Controls can be implemented prior to loading data, in the data
migration, cleansing, transforming, and loading processes.
3-43
結束
Data Quality
An example of multiple variations, as illustrated in
Table. 3.6.
What are the variations?
1. Variations of the same customer
2. Misspell
3. Corrected spell but with a more complete definition
3-44
結束
Data Quality
Matching involves associating variables.
Software used to introduce new data into the data
warehouse needs to check that the appropriate spelling
and entry values are used. Also, matching companies
with addresses… and some maintenance.
Software tools to ensure data quality, including:
 The analysis of data for type
 The construction of standardization schemes
 The identification of redundant data
 The adjustment of matching criteria to achieve selected
levels of discrimination
 The transformation of data into designed format
3-45
結束
Software products
3-46