Download Data Warehouse - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Operational transformation wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

SAP IQ wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

Database model wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Data Warehousing : Introduction
Data Warehouse : Architecture
What is a Data Warehouse?
• A data warehouse is a relational database that is
designed for query and analysis rather than for
transaction processing.
• It usually contains historical data derived from
transaction data, but it can include data from
other sources.
• It separates analysis workload from transaction
workload and enables an organization to
consolidate data from several sources.
Data Warehouse Environment
• In addition to a relational database, a data
warehouse environment includes an
• Extraction
• Transportation
• Transformation
• Loading
• An online analytical processing (OLAP) engine
• Client analysis tools
and other applications that manage the process of
gathering data and delivering it to business users.
Characteristics of a Data Warehouse
• A common way of introducing data
warehousing is to refer to the characteristics
of a data warehouse as set forth by William
Inmon:
• Subject Oriented
• Integrated
• Nonvolatile
• Time Variant
Subject Oriented
• Data warehouses are designed to help you
analyze data.
• For example, to learn more about your
company's sales data, you can build a warehouse
that concentrates on sales. Using this warehouse,
you can answer questions like "Who was our best
customer for this item last year?"
• This ability to define a data warehouse by subject
matter, sales in this case, makes the data
warehouse subject oriented.
Integrated
• Integration is closely related to subject
orientation. Data warehouses must put data
from disparate sources into a consistent
format.
• They must resolve such problems as naming
conflicts and inconsistencies among units of
measure. When they achieve this, they are
said to be integrated.
Nonvolatile
• Nonvolatile means that, once entered into the
warehouse, data should not change. This is
logical because the purpose of a warehouse is
to enable you to analyze what has occurred.
Time Variant
• In order to discover trends in business,
analysts need large amounts of data. This is
very much in contrast to online transaction
processing (OLTP) systems, where
performance requirements demand that
historical data be moved to an archive.
• A data warehouse's focus on change over time
is what is meant by the term time variant.
Contrasting OLTP and Data
Warehousing Environments
OLTP and Data Warehousing
Environments
• Data warehouses and OLTP systems have very
different requirements. Here are some examples
of differences between typical data warehouses
and OLTP systems:
•
•
•
•
•
Workload
Data modifications
Schema design
Typical operations
Historical data
Workload
• Data
warehouses
are
designed
to
accommodate ad hoc queries. You might not
know the workload of your data warehouse in
advance, so a data warehouse should be
optimized to perform well for a wide variety of
possible query operations.
• OLTP systems support only predefined
operations. Your applications might be
specifically tuned or designed to support only
these operations.
Data Modifications
• A data warehouse is updated on a regular basis
by the ETL process (run nightly or weekly) using
bulk data modification techniques. The end users
of a data warehouse do not directly update the
data warehouse.
• In OLTP systems, end users routinely issue
individual data modification statements to the
database. The OLTP database is always up to
date, and reflects the current state of each
business transaction.
Schema design
• Data warehouses often use de normalized or
partially de normalized schemas (such as a
star schema) to optimize query performance.
• OLTP systems often use fully normalized
schemas to optimize update/insert/delete
performance, and to guarantee data
consistency.
Typical operations
• A typical data warehouse query scans
thousands or millions of rows. For example,
"Find the total sales for all customers last
month."
• A typical OLTP operation accesses only a
handful of records. For example, "Retrieve the
current order for this customer."
Historical data
• Data warehouses usually store many months
or years of data. This is to support historical
analysis.
• OLTP systems usually store data from only a
few weeks or months. The OLTP system stores
only historical data as needed to successfully
meet the requirements of the current
transaction.
Data Warehouse Applications
• As discussed before, a data warehouse helps business
executives to organize, analyze, and use their data for
decision making. A data warehouse serves as a sole
part of a plan-execute-assess "closed-loop" feedback
system for the enterprise management. Data
warehouses are widely used in the following fields:
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
Strategic uses of data warehousing
Industry
Functional areas of
use
Strategic use
Airline
Operations; marketing
Crew assignment, aircraft development, mix
of fares, analysis of route profitability,
frequent flyer program promotions
Banking
Product development;
Operations; marketing
Customer service, trend analysis, product and
service promotions, reduction of IS
expenses
Credit card
Product development;
marketing
Customer service, new information service,
fraud detection
Health care
Operations
Reduction of operational expenses
Investment and
Insurance
Product development;
Operations; marketing
Risk management, market movements
analysis, customer tendencies analysis,
portfolio management
Retail chain
Distribution; marketing
Trend analysis, buying pattern analysis,
pricing policy, inventory control, sales
promotions, optimal distribution channel
Telecommunications
Product development;
Operations; marketing
New product and service promotions,
reduction of IS budget, profitability
analysis
Personal care
Distribution; marketing
Distribution decisions, product promotions,
sales decisions, pricing policy
Public sector
Operations
Intelligence gathering
Functions of Data Warehouse Tools
and Utilities
• Data Extraction - Involves gathering data from
multiple heterogeneous sources.
• Data Cleaning - Involves finding and correcting
the errors in data.
• Data Transformation - Involves converting the
data from legacy format to warehouse format.
• Data Loading - Involves sorting, summarizing,
consolidating, checking integrity, and building
indices and partitions.
• Refreshing - Involves updating from data sources
to warehouse.
Disadvantages of data warehouses
• Data warehouses are not the optimal environment for
unstructured data.
• Because data must be extracted, transformed and loaded into the
warehouse, there is an element of latency in data warehouse
data.
• Over their life, data warehouses can have high costs.
Maintenance costs are high.
• Data warehouses can get outdated relatively quickly. There is a
cost of delivering suboptimal information to the organization.
• There is often a fine line between data warehouses and
operational systems. Duplicate, expensive functionality may be
developed. Or, functionality may be developed in the data
warehouse that, in retrospect, should have been developed in the
operational systems and vice versa.
Data Warehousing Typology
• The virtual data warehouse – the end users have
direct access to the data stores, using tools enabled
at the data access layer
• The central data warehouse – a single physical
database contains all of the data for a specific
functional area
• The distributed data warehouse – the components
are distributed across several physical databases
The architecture
Reporting, query,
application development,
and EIS(executive information
system) tools
Operational
data source1
High
summarized data
Meta-data
Operational
data source 2
Lightly
summarized
data
Load Manager
Operational
data source n
Detailed data
Query Manage
DBMS
OLAP(online
analytical processing) tools
Warehouse Manager
Operational
data store (ods)
Operational data store (ODS)
Data mining
Archive/backup
data
Typical architecture of a data warehouse
End-user
access tools
The main components
• Operational data sourcesThe data in
DW is supplied from
mainframe operational data sources like hierarchical and network
databases, proprietary file systems, private serves and external systems
such as the Internet, commercially available DB, or DB assoicated with and
organization’s suppliers or customers
• Operational datastore(ODS)It is a repository of current
and integrated operational data used for analysis. It is often structured
and supplied with data in the same way as the data warehouse, but may in
fact simply act as a staging area for data to be moved into the warehouse
The main components
• load manager Is also called the frontend component, it performance
all the operations associated with the extraction and loading of data into
the warehouse. These operations include simple transformations of the
data to prepare the data for entry into the warehouse
• Warehouse managerperforms all the operations associated with
the management of the data in the warehouse. The operations performed
by this component include analysis of data to ensure consistency,
transformation and merging of source data, creation of indexes and views,
generation of denormalizations and aggregations, and archiving and
backing-up data
The main components
• Query manageralso called backend component, it performs all the
operations associated with the management of user queries. The
operations performed by this component include directing queries to the
appropriate tables and scheduling the execution of queries
• Detailed, lightly and lightly summarized data,archive/backup
data
• Meta-data
• end-user access toolscan be categorized into five main groups:
data reporting and query tools, application development tools, executive
information system (EIS) tools, online analytical processing (OLAP) tools,
and data mining tools
Data Warehousing - Schemas
• Schema is a logical description of the entire
database. It includes the name and
description of records of all record types
including all associated data-items and
aggregates. Much like a database, a data
warehouse also requires to maintain a
schema.
Fact table
• In data warehousing, a Fact table consists of
the measurements, metrics or facts of
a business process. It is located at the center
of a star schema or a snowflake
schema surrounded by dimension tables.
Where multiple fact tables are used, these are
arranged as a fact constellation schema.
Database schema for a data warehouse
• Star schema
• Snowflake schema
Star Schema
• Each dimension in a star schema is
represented with only one-dimension table.
• This dimension table contains the set of
attributes.
• The following diagram shows the sales data of
a company with respect to the four
dimensions, namely time, item, branch, and
location.
Star Schema
Star Schema
• Each dimension has only one dimension table
and each table holds a set of attributes. For
example, the location dimension table contains
the attribute set {location_key, street, city,
province_or_state,country}. This constraint may
cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in
the Canadian province of British Columbia. The
entries for such cities may cause data redundancy
along the attributes province_or_state and
country.
Snowflake Schema
Snowflake Schema
• Note: Due to normalization in the Snowflake
schema, the redundancy is reduced and
therefore, it becomes easy to maintain and
the save storage space.
Ease of maintenance /
change
Ease of Use
Query Performance
Snowflake Schema
Star Schema
No redundancy, so
snowflake schemas are
easier to maintain and
change.
Has redundant data and
hence less easy to
maintain/change
More complex queries and Lower query complexity
hence less easy to
and easy to understand
understand
More foreign keys and
hence longer query
execution time (slower)
Less number of foreign
keys and hence shorter
query execution time
(faster)
Good to use for
Good for datamarts with
datawarehouse core to
simple relationships (1:1 or
Type of Data warehouse simplify complex
1:many)
relationships (many:many)
Joins
Dimension table
When to use
Normalization/
De-Normalization
Snowflake Schema
Star Schema
Higher number of Joins
Fewer Joins
A snowflake schema may have more than A star schema
one dimension table for each dimension. contains only single
dimension table for
each dimension.
When dimension table is relatively big in When dimension
size, snowflaking is better as it reduces
table contains less
space.
number of rows, we
can choose Star
schema.
Dimension Tables are in Normalized form Both Dimension and
but Fact Table is in De-Normalized form Fact Tables are in
De-Normalized form
Queries?