Download data warehouses!

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Database model wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
DATA
WAREHOUSING
Introduction


Modern organizations have huge amounts of data
but are starving for information – facing information
gap!
Reasons for information gap:
Fragmented information systems, uncoordinated
(sometimes inconsistent) databases /islands of
information due to time and resources constraints
 Most systems support operational /transaction processing
rather than informational processing.


Bridging the information gap are data warehouses!
Introduction (Contd.)

A data warehouse bridges this information gap
as
 it
consolidates and integrates information from
internal and external sources and
 arranges it in a meaningful format for making
accurate business decisions

The two noticeable pioneers in the DWH field
are Ralph Kimball and Bill Inmon.

The term Data Warehouse was coined by Bill
Inmon in 1990
What is a Data Warehouse?

Inmon defines a data warehouse as follows:
 "a
subject-oriented, integrated, time-variant and
non-volatile collection of data in support of
management's decision making process".

Description of the above terms according to
Inmon
 Subject

Oriented:
Data that gives information about a particular subject instead
of about a company's ongoing operations.
 Integrated:

Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
 Time-variant:

All data in the data warehouse is identified with a
particular time dimension. The time factor can be used to
study trends and changes.
 Non-volatile

Data is stable in a data warehouse. More data is added
but data is never removed. This enables management to
gain a consistent picture of the business.
What is a Data Warehouse?

This definition by Inmon remains reasonably
accurate almost ten years later. However,
a
single-subject data warehouse is typically referred
to as a data mart, while data warehouses are
generally enterprise in scope.
 data warehouses can be volatile.


Due to the large amount of storage required for a data
warehouse, (multi-terabyte data warehouses are not
uncommon), only a certain number of periods of history are
kept in the warehouse.
For instance, if three years of data are decided on and
loaded into the warehouse, every month the oldest month will
be "rolled off" the database, and the newest month added.
What is a Data Warehouse?

Ralph Kimball provided a much simpler
definition of a data warehouse.
a
data warehouse is "a copy of transaction data
specifically structured for query and analysis".

This definition provides less insight and depth
than Mr. Inmon's, but is no less accurate.
The Need for Data warehousing

Two factors drive the need for data warehousing
 Need
of company-wide view of data
 Need
to separate operational* and informational*
systems
* See slide notes
Data warehousing?


It is the process of creating, populating, and then
querying a data warehouse
Components of data warehousing are







Source Systems Identification
Data warehouse Design and Creation
Data Acquisition
Changed data capture
Data Cleansing
Data Aggregation
Business Intelligence





Multidimensional Analysis Tools
Query Tools
Data Mining Tools
Data Visualization Tools
Metadata Management
Components of Data
warehousing

Source System Identification


Appropriate data must be located to build a data warehouse.
This data comes from



Current OLTP systems (providing day-to-day information)
`Legacy systems (providing historical data for prior periods)
Data Warehouse Design and Creation


The design is often an iterative process
Requires effort to understand database schema and a great deal
of user interaction
Components of Data warehousing

Data Acquisition
 Involves
moving company data from source systems
into warehouse
 Time consuming activity
 Performed using ETL (Extract/Transform/ Load) Tools
 Data acquisition is an ongoing, scheduled process.
Warehouse is refreshed monthly

Changed data capture
 Periodic
update of warehouse from transactional
systems is complicated as it is difficult to identify
which records in source have changed since last
update.
 Some technologies used in this area are Replication
servers, Publish/Subscribe, Triggers and Stored
procedures, and Database Log analysis.
Components of Data warehousing

Data Cleansing

Typically performed in conjunction with data acquisition
(part of “T” in “ETL”)
 A complicated process that validates and if required
corrects data before its loaded into warehouse
 Example




The entries for "Customer Name" may appear differently in
various source systems for the same customer.
one entered as "IBM", one as "I.B.M.", and one as
"International Business Machines".
A decision must be made as to which is correct, and then the
data cleansing tool will change the others to match the rule.
This process is also referred to as "data scrubbing" or "data
quality assurance". It can be an extremely complex process,
especially if some of the warehouse inputs are from older
mainframe file systems (commonly referred to as "flat files" or
"sequential files").
Components of Data warehousing

Data Aggregation





If required, it is performed during “T” phase of “ETL”
Data warehouse can be designed to store data at detail level
(each individual transaction), at some aggregate level (summary
data) or a combination of both.
The advantage of summarized data is that typical queries
against the warehouse run faster.
The disadvantage is that information, which may be needed to
answer a query, is lost during aggregation
Business Intelligence (BI)

Contains technologies such as Decision Support Systems
(DSS), Executive Information Systems (EIS), On-Line Analytical
Processing (OLAP), Relational OLAP (ROLAP), MultiDimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a
combination of MOLAP and ROLAP), and more.
Components of Data warehousing
 BI

can be broken down into four broad fields
Multidimensional Analysis Tools
Allow user to look at data from various different angles
 Often use a multidimensional database called cube


Query Tools


Allow user to issue SQL queries against warehouse and
get results
Data Mining Tools
Automatically search for patterns in data
 Driven by complex statistical formulas


Data Visualization Tools

Graphically represent data including 3D data pictures
Components of Data warehousing

Metadata management
 Throughout
the entire process of identifying,
acquiring, and querying the data, metadata
management takes place
 The datatype (e.g., string or integer) of the column is
metadata. The name of the column is another. The
actual value in the column for a particular row is not
metadata - it is data
 Metadata is stored in metadata repository
 Metadata is useful in almost all components of data
warehousing discussed earlier
Data Mining

Definition:


Knowledge discovery using a sophisticated blend of techniques
from traditional statistics, artificial intelligence and computer
graphics
Goals



To explain observed events or conditions, such as why sales of a
product have increased in a particular area
To confirm hypothesis, such as, whether two-income families are
more likely to buy family medical cover than single-income
families
To analyze data for new or unexpected relationships, such as
what spending patterns are likely to accompany credit card
fraud.
Data Mining Techniques

Case-based reasoning


Rule discovery


Identifies clusters of observations with similar
characteristics
Neural nets


Searches for patterns and correlations in large data sets
Signal Processing


Derives rules from real world case examples
Develops predictive models based on principles
modeled after the human brain
Fractals

Compresses large databases without losing information
Some Data Mining Applications

Analysis of business trends


Target marketing


Identifying usage patterns of products and services
Product Affinity


Identifying customers for promotional activity
Usage Analysis


Identifying markets with above average or below average growth
Identifying products that are purchased concurrently or
characteristics of shoppers
For further application types - refer to book page 437
table 11.5
Reading Assignment

Read these concepts from Book
 The
ETL process
 Definitions


@ctive data warehouse
Data mart








Independent data mart
dependent data mart
EDW
Star Schema
Snowflake Schema
OLAP
MOLAP
ROLAP
Resources
http://www.intranetjournal.com/features/dat
awarehousing.html
 Modern Database Management, 6/e
