Download UNIT II DATA WAREHOUSING Data ware house – characteristics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
UNIT II DATA WAREHOUSING
Data ware house – characteristics and view - OLTP and OLAP - Design and development of
data warehouse, Meta data models, Extract/ Transform / Load (ETL) design
Data Warehousing – Overview
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection
of data. This data helps analysts to take informed decisions in an organization.
An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place. Suppose a business executive wants to analyze previous
feedback on any data such as a product, a supplier, or any consumer data, then the
executive will have no data available to analyze because the previous data has been
updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional
view. Along with generalized and consolidated view of data, a data warehouses also
provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive
and effective analysis of data in a multidimensional space. This analysis results in data
generalization and data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at
multiple level of abstraction. That's why data warehouse has now become an important
platform for data analysis and online analytical processing.
Understanding a Data Warehouse
 A data warehouse is a database, which is kept separate from the organization's
operational database.

There is no frequent updating done in a data warehouse.

It possesses consolidated historical data, which helps the organization to analyze
its business.

A data warehouse helps executives to organize, understand, and use their data to
take strategic decisions.

Data warehouse systems help in the integration of diversity of application systems.

A data warehouse system helps in consolidated historical data analysis.
Why a Data Warehouse is Separated from Operational Databases
A data warehouses is kept separate from operational databases due to the following
reasons:

An operational database is constructed for well-known tasks and workloads such
as searching particular records, indexing, etc. In contract, data warehouse queries
are often complex and they present a general form of data.

Operational databases support concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational
databases to ensure robustness and consistency of the database.

An operational database query allows to read and modify operations, while an
OLAP query needs only read only access of stored data.

An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
Data Warehouse Features
The key features of a data warehouse are discussed below:

Subject Oriented - A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing operations.
These subjects can be product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it focuses on
modelling and analysis of data for decision making.

Integrated - A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This integration
enhances the effective analysis of data.

Time Variant - The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information from
the historical point of view.

Non-volatile - Non-volatile means the previous data is not erased when new data
is added to it. A data warehouse is kept separate from the operational database
and therefore frequent changes in operational database is not reflected in the data
warehouse.
Note: A data warehouse does not require transaction processing, recovery, and
concurrency controls, because it is physically stored and separate from the operational
database.
Data Warehouse Applications
As discussed before, a data warehouse helps business executives to organize, analyze, and
use their data for decision making. A data warehouse serves as a sole part of a planexecute-assess "closed-loop" feedback system for the enterprise management. Data
warehouses are widely used in the following fields:





Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below:

Information Processing - A data warehouse allows to process the data stored in it.
The data can be processed by means of querying, basic statistical analysis,
reporting using crosstabs, tables, charts, or graphs.

Analytical Processing - A data warehouse supports analytical processing of the
information stored in it. The data can be analyzed by means of basic OLAP
operations, including slice-and-dice, drill down, drill up, and pivoting.

Data Mining - Data mining supports knowledge discovery by finding hidden
patterns and associations, constructing analytical models, performing classification
and prediction. These mining results can be presented using the visualization tools.
Sr.No. Data Warehouse (OLAP)
Operational Database(OLTP)
1
It involves historical processing of
information.
It involves day-to-day processing.
2
OLAP systems are used by knowledge
workers such as executives, managers,
and analysts.
OLTP systems are used by clerks, DBAs, or
database professionals.
3
It is used to analyze the business.
It is used to run the business.
4
It focuses on Information out.
It focuses on Data in.
5
It is based on Star Schema, Snowflake
Schema, and Fact Constellation
Schema.
It is based on Entity Relationship Model.
6
It focuses on Information out.
It is application oriented.
7
It contains historical data.
It contains current data.
8
It provides summarized and
consolidated data.
It provides primitive and highly detailed
data.
9
It provides summarized and
multidimensional view of data.
It provides detailed and flat relational view
of data.
10
The number of users is in hundreds.
The number of users is in thousands.
11
The number of records accessed is in
millions.
The number of records accessed is in tens.
12
The database size is from 100GB to 100 The database size is from 100 MB to 100
TB.
GB.
13
These are highly flexible.
It provides high performance.
Using Data Warehouse Information
There are decision support technologies that help utilize the data available in a data
warehouse. These technologies help executives to use the warehouse quickly and
effectively. They can gather data, analyze it, and take decisions based on the information
present in the warehouse. The information gathered in a warehouse can be used in any of
the following domains:

Tuning Production Strategies - The product strategies can be well tuned by
repositioning the products and managing the product portfolios by comparing the
sales quarterly or yearly.

Customer Analysis - Customer analysis is done by analyzing the customer's
buying preferences, buying time, budget cycles, etc.

Operations Analysis - Data warehousing also helps in customer relationship
management, and making environmental corrections. The information also allows
us to analyze business operations.
Integrating Heterogeneous Databases
To integrate heterogeneous databases, we have two approaches:

Query-driven Approach

Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was
used to build wrappers and integrators on top of multiple heterogeneous databases.
These integrators are also known as mediators.
Process of Query-Driven Approach

When a query is issued to a client side, a metadata dictionary translates the query
into an appropriate form for individual heterogeneous sites involved.

Now these queries are mapped and sent to the local query processor.

The results from heterogeneous sites are integrated into a global answer set.
Disadvantages

Query-driven approach needs complex integration and filtering processes.

This approach is very inefficient.

It is very expensive for frequent queries.

This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow
update-driven approach rather than the traditional approach discussed earlier. In updatedriven approach, the information from multiple heterogeneous sources are integrated in
advance and are stored in a warehouse. This information is available for direct querying
and analysis.
Advantages
This approach has the following advantages:

This approach provide high performance.

The data is copied, processed, integrated, annotated, summarized and restructured
in semantic data store in advance.

Query processing does not require an interface to process data at local sources.
Functions of Data Warehouse Tools and Utilities
The following are the functions of data warehouse tools and utilities:

Data Extraction - Involves gathering data from multiple heterogeneous sources.

Data Cleaning - Involves finding and correcting the errors in data.

Data Transformation - Involves converting the data from legacy format to
warehouse format.

Data Loading - Involves sorting, summarizing, consolidating, checking integrity,
and building indices and partitions.

Refreshing - Involves updating from data sources to warehouse.
Note: Data cleaning and data transformation are important steps in improving the quality
of data and data mining results.
Data Warehousing - Terminologies
Metadata
Metadata is simply defined as data about data. The data that are used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is the summarized data
that leads us to the detailed data.
In terms of data warehouse, we can define metadata as following:

Metadata is a road-map to data warehouse.

Metadata in data warehouse defines the warehouse objects.

Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the
following metadata:

Business metadata - It contains the data ownership information, business
definition, and changing policies.

Operational metadata - It includes currency of data and data lineage. Currency of
data refers to the data being active, archived, or purged. Lineage of data means
history of data migrated and transformation applied on it.

Data for mapping from operational environment to data warehouse - It
metadata includes source databases and their contents, data extraction, data
partition, cleaning, transformation rules, data refresh and purging rules.

The algorithms for summarization - It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.
Data Cube
A data cube helps us represent data in multiple dimensions. It is defined by dimensions
and facts. The dimensions are the entities with respect to which an enterprise preserves
the records.
Illustration of Data Cube
Suppose a company wants to keep track of sales records with the help of sales data
warehouse with respect to time, item, branch, and location. These dimensions allow to
keep track of monthly sales and at which branch the items were sold. There is a table
associated with each dimension. This table is known as dimension table. For example,
"item" dimension table may have attributes such as item_name, item_type, and
item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to
time, item, and location dimensions.
But here in this 2-D table, we have records with respect to time and item only. The sales
for New Delhi are shown with respect to time, and item dimensions according to type of
items sold. If we want to view the sales data with one more dimension, say, the location
dimension, then the 3-D view would be useful. The 3-D view of the sales data with respect
to time, item, and location is shown in the table below:
The above 3-D table can be represented as 3-D data cube as shown in the following figure:
Data Mart
Data marts contain a subset of organization-wide data that is valuable to specific groups
of people in an organization. In other words, a data mart contains only those data that is
specific to a particular group. For example, the marketing data mart may contain only data
related to items, customers, and sales. Data marts are confined to subjects.
Points to Remember About Data Marts
 Windows-based or Unix/Linux-based servers are used to implement data marts.
They are implemented on low-cost servers.

The implementation cycle of a data mart is measured in short periods of time, i.e.,
in weeks rather than months or years.

The life cycle of data marts may be complex in the long run, if their planning and
design are not organization-wide.

Data marts are small in size.

Data marts are customized by department.

The source of a data mart is departmentally structured data warehouse.

Data marts are flexible.
The following figure shows a graphical representation of data marts.
Virtual Warehouse
The view over an operational data warehouse is known as virtual warehouse. It is easy to
build a virtual warehouse. Building a virtual warehouse requires excess capacity on
operational database servers.
OLTP vs OLAP
One of the most important questions regarding information systems is the difference
between OLAP and OLTP. Based on that, we built this article to explain further on these
ideas and to solidify your knowledge of them. To fully understand and compare these
two types of systems you have to know what they are, and how they work individually.
So, first we prepared a lot of information about OLAP and O OLTP vs OLAP
One of the most important questions
regarding information systems is the difference between OLAP and OLTP.
Based on that, we built this article to explain further on these ideas and to solidify your
knowledge of them. To fully understand and compare these two types of systems you
have to know what they are, and how they work individually. So, first we prepared a lot of
information about OLAP and OLTP, concluding the resource with acomparative analysis
between them.
Let’s jump right away to the learning process!
In the next chapters, we’ll be describing each topic in a complete, yet simple way. Before
going any further on these topics, we show you a simple infographic comparing the two
approaches:
What is OLAP?
Online analytical processing is a computer technology term referring to systems focused on
analysing data in a specific database. This kind of systems are characterized for their
analytical capabilities, addressing multi-dimensional or one dimension data, processing all
the information. The standard applications of OLAP are bussiness intelligence, data writing
and reporting, throught data mining processes.
OLAP operations and databases
On the database level, these systems operation is defined by a low level of transactions,
dealing with archived and historical information. This data is seldom updated, identifying
theSELECT database operation as the key feature of the system. Therefore, this kind of
databases are based on READ operations, aggregating all available information.
Databases that work as data warehouses apply this methodology, optimizing reading and
aggregation operations of its multidimensional data model. Thus providing a great support
for data analysis and reporting operations, critical in these kind of databases.
Data cube
The main component of these systems is a OLAP cube. A cube consists in combining data
warehouse’s structures like facts and dimensions. Those are organized as schemas: star
schema, snowflake schema and fact constellation. The merging of all the cubes creates a
multidimensional data warehouse.
System types
There are many types of OLAP systems, depending on it’s structure characteristics. The
most common ones are: MOLAP, ROLAP and HOLAP.
The most important real world applications of these systems are: bussiness management
and reporting, financial reporting, marketing, research and another data related issues.
These processes are growing faster on these days, making them absolutely critical in a
world that is becoming dependent of data. In the next paragraph we will provide a real
world example of what we described before.
Real World Example: In a hospital there is 20 years of very complete patient information
stored. Someone on the administration wants a detailed report of the most common
deseases, sucess rate of treatment, intership days and a lot of relevant data. For this, we
apply OLAP operations to our data warehouse with historical information, and throught
complex queries we get these results. Then they can be reported to the administration for
further analysis.
What is OLTP?
Online Transaction Processing is a information system type that prioritizes transaction
processing, dealing with operational data. This kind of computer systems are identified by
the large number of transactions they support, making them the best to address online
application. The main applications of this method are all kind of transactional systems like
databases, commercial, hospital applications and so on.
In a simple way, these systems gather input information and store them on a database, in a
large scale. Most of today’s applications are based on this interaction methodology, with
implementations of centralized or descentralized systems.
OLTP database and operations
On the database level, these transactional systems base their operation on multi-access,
fast and effective querys to the database. The most used operations are INSERT, UPDATE
and DELETE, since they are directly modifying the data, providing new information on new
trasactions. So, in these systems, data is frequently updated, requiring a effective write
operations support.
One special characteristic of those databases is the normalization of it’s data. This happens
because data normalization provides a faster and more effective way to perform database
writes. The main concern is the atomicity of the trasanctions and ensuring that concurrent
accesses don’t damage data and also don’t degradate system’s performance.
Other systems
OLTP is not only about databases, but also other types of interaction mecanisms. All clientserver architectures are based on these processes, taking benefit of the fast transaction
and concurrent models. Descentralized systems are also online transaction processing, as
all broker programs and web servervices are transaction oriented.
Real World Example: A banking transaction system is a classic example. There are many
users executing operations into their accounts and the system must guarantee the
completeness of the actions. In this case there are several concurrent transactions at the
same time, being data coherence and efficient operations the main goal.
Comparing OLTP vs OLAP
OLTP, also known as Online Transaction Processing, and OLAP which stands for Online
Analytical Processing, are two distinct kinds of information systems technologies.
Both are related to information databases, which provide the means and support for
these two types of functioning.
Each one of the methods creates a different branch on data management system, with it’s
own ideas and processes, but they complement themselves. To analyse and compare
them we’ve built this resource!
Basically, OLAP and OLTP are very different approaches to the use of databases, but
not only. In one hand online analytical processing is more focused on data analysis and
reporting, on the other hand online trasaction processing target a transaction optimized
system, with a lot of data changes.
For someone learning about data sciences, related to IT methods, it is important to know
the difference between these two approaches to information. This is the base idea to
systems like business intelligence, data mining, data warehousing, data modelling, etl
processes and big data.
Regarding the previous descriptions of the systems, we can compare them in a lot of
distinct categories.
The review is detailed in the next table. Then we have a further discussion on each
compared item which could evoke some doubts, to ensure you understood.
LTP, concluding the resource with acomparative analysis between them.
Let’s jump right away to the learning process! In the next chapters, we’ll be describing
each topic in a complete, yet simple way. Before going any further on these topics, we show
you a simple infographic comparing the two approaches:
What is OLAP?
Online analytical processing is a computer technology term referring to systems focused on
analysing data in a specific database. This kind of systems are characterized for their
analytical capabilities, addressing multi-dimensional or one dimension data, processing all
the information. The standard applications of OLAP are bussiness intelligence, data writing
and reporting, throught data mining processes.
OLAP operations and databases
On the database level, these systems operation is defined by a low level of transactions,
dealing with archived and historical information. This data is seldom updated, identifying
theSELECT database operation as the key feature of the system. Therefore, this kind of
databases are based on READ operations, aggregating all available information.
Databases that work as data warehouses apply this methodology, optimizing reading and
aggregation operations of its multidimensional data model. Thus providing a great support
for data analysis and reporting operations, critical in these kind of databases.
Data cube
The main component of these systems is a OLAP cube. A cube consists in combining data
warehouse’s structures like facts and dimensions. Those are organized as schemas: star
schema, snowflake schema and fact constellation. The merging of all the cubes creates a
multidimensional data warehouse.
System types
There are many types of OLAP systems, depending on it’s structure characteristics. The
most common ones are: MOLAP, ROLAP and HOLAP.
The most important real world applications of these systems are: bussiness management
and reporting, financial reporting, marketing, research and another data related issues.
These processes are growing faster on these days, making them absolutely critical in a
world that is becoming dependent of data. In the next paragraph we will provide a real
world example of what we described before.
Real World Example: In a hospital there is 20 years of very complete patient information
stored. Someone on the administration wants a detailed report of the most common
deseases, sucess rate of treatment, intership days and a lot of relevant data. For this, we
apply OLAP operations to our data warehouse with historical information, and throught
complex queries we get these results. Then they can be reported to the administration for
further analysis.
What is OLTP?
Online Transaction Processing is a information system type that prioritizes transaction
processing, dealing with operational data. This kind of computer systems are identified by
the large number of transactions they support, making them the best to address online
application. The main applications of this method are all kind of transactional systems like
databases, commercial, hospital applications and so on.
In a simple way, these systems gather input information and store them on a database, in a
large scale. Most of today’s applications are based on this interaction methodology, with
implementations of centralized or descentralized systems.
OLTP database and operations
On the database level, these transactional systems base their operation on multi-access,
fast and effective querys to the database. The most used operations are INSERT, UPDATE
and DELETE, since they are directly modifying the data, providing new information on new
trasactions. So, in these systems, data is frequently updated, requiring a effective write
operations support.
One special characteristic of those databases is the normalization of it’s data. This happens
because data normalization provides a faster and more effective way to perform database
writes. The main concern is the atomicity of the trasanctions and ensuring that concurrent
accesses don’t damage data and also don’t degradate system’s performance.
Other systems
OLTP is not only about databases, but also other types of interaction mecanisms. All clientserver architectures are based on these processes, taking benefit of the fast transaction
and concurrent models. Descentralized systems are also online transaction processing, as
all broker programs and web servervices are transaction oriented.
Real World Example: A banking transaction system is a classic example. There are many
users executing operations into their accounts and the system must guarantee the
completeness of the actions. In this case there are several concurrent transactions at the
same time, being data coherence and efficient operations the main goal.
Comparing OLTP vs OLAP
OLTP, also known as Online Transaction Processing, and OLAP which stands for Online
Analytical Processing, are two distinct kinds of information systems technologies.
Both are related to information databases, which provide the means and support for
these two types of functioning.
Each one of the methods creates a different branch on data management system, with it’s
own ideas and processes, but they complement themselves.
To analyse and compare them we’ve built this resource!
Basically, OLAP and OLTP are very different approaches to the use of databases, but
not only. In one hand online analytical processing is more focused on data analysis and
reporting, on the other hand online trasaction processing target a transaction optimized
system, with a lot of data changes.
For someone learning about data sciences, related to IT methods, it is important to know
the difference between these two approaches to information. This is the base idea to
systems like business intelligence, data mining, data warehousing, data modelling, etl
processes and big data.
Regarding the previous descriptions of the systems, we can compare them in a lot of
distinct categories.
The review is detailed in the next table. Then we have a further discussion on each
compared item which could evoke some doubts, to ensure you understood.
Design methods
Bottom-up design
In the bottom-up approach, data marts are first created to provide reporting and analytical
capabilities for specific business processes. These data marts can then be integrated to
create a comprehensive data warehouse. The data warehouse bus architecture is primarily
an implementation of "the bus", a collection ofconformed dimensions and conformed facts,
which are dimensions that are shared (in a specific way) between facts in two or more data
marts.[15]
Top-down design
The top-down approach is designed using a normalized enterprise data model. "Atomic"
data, that is, data at the greatest level of detail, are stored in the data warehouse.
Dimensional data marts containing data needed for specific business processes or specific
departments are created from the data warehouse.[16]
Hybrid design
Data warehouses (DW) often resemble the hub and spokes architecture. Legacy
systems feeding
the
warehouse
often
include customer
relationship
managementand enterprise resource planning, generating large amounts of data. To
consolidate these various data models, and facilitate the extract transform load process,
data warehouses often make use of an operational data store, the information from which
is parsed into the actual DW. To reduce data redundancy, larger systems often store the
data in a normalized way. Data marts for specific reports can then be built on top of the
DW.
The DW database in a hybrid solution is kept on third normal form to eliminate data
redundancy. A normal relational database, however, is not efficient for business
intelligence reports where dimensional modelling is prevalent. Small data marts can shop
for data from the consolidated warehouse and use the filtered, specific data for the fact
tables and dimensions required. The DW provides a single source of information from
which the data marts can read, providing a wide range of business information. The hybrid
architecture allows a DW to be replaced with a master data management solution where
operational, not static information could reside.
The Data Vault Modeling components follow hub and spokes architecture. This modeling
style is a hybrid design, consisting of the best practices from both third normal form
and star schema. The Data Vault model is not a true third normal form, and breaks some of
its rules, but it is a top-down architecture with a bottom up design. The Data Vault model is
geared to be strictly a data warehouse. It is not geared to be end-user accessible, which
when built, still requires the use of a data mart or star schema based release area for
business purposes.
What Is A Metadata Model?
This article is about creating metadata models for digital asset management. In simple
terms, a metadata model is how you will represent the metadata stored about your digital
assets. It is like the blueprint or DNA that will be used each time a DAM user catalogues an
asset.
Why Do You Need A Metadata Model?
Metadata models define the essential characteristics of your assets in a way that is unique
to you and your organisation. They describe a series of key entities or classifications. As
well as cataloguing, metadata models can get populated by other activity on a DAM system,
for example, workflow to request approval to use an asset. Any activity on a DAM system
where users or processes interact with assets takes place within the framework of the
metadata model. You will find it touches nearly every element of a DAM implementation –
which is why it is important you give it sufficient consideration when planning for digital
asset management.
What Goes Into A Metadata Model?
There are many different ways to describe metadata models, but providing all the required
information is captured, the simpler they are the better. A list of the key items of data you
need to store such as the one shown in the previous section is the starting point, but you
will probably want to expand that to define how users will enter metadata. For example,
will it be from a fixed list (e.g. a controlled vocabulary) or perhaps free text, maybe
numbers or dates. If allowing users to choose from pre-determined selections, will you
allow them one option or many? You can record these decisions in a spreadsheet or build
simple prototypes using the built-in capabilities of the system. Screen mock-ups of what
the interface will look like are another technique.
An issue you can get into when discussing metadata models with colleagues is they may
tend to concentrate too much on the content or ‘ingredients’ that might go into the fields.
For example, if your DAM system will hold marketing materials about your firm’s products,
they might reel off lists of product brand names or model numbers. These are important
and keeping records of them is a good idea, but when devising metadata models, you are
more interested in the range of potential classifications – the breadth rather than the depth
if you want to think about it in spatial terms. Information architects and other DAM
experts might refer to this as the ‘schema’ and that description should give you a clue that
this about overall design decisions and metadata strategy rather than specific values.
One area where analyzing the range of data that might need to be held in a metadata model
is important is in assessing the quantity of different values that might need to be held in a
given field. This will assist to determine what kind of interface controls are best suited for
it. For example, if every entry is totally different, a free text field would be a good idea. For
a small number of mutually exclusive options, radio buttons more suitable. On other
occasions, you might use a hierarchical taxonomy which links to a faceted search. The
number of items used can make some interface choices more or less appropriate than
others.
WHAT ARE METADATA?
This Book
Elements
Objects:
Objects: "Entity
Objects: "Table"
Objects:
(Metametadata
of
"Entity
Class" "Attribute"
"Column"
"Program
)
metadata
Class"
"Role"
module"
(metadat "Attribute"
"Language"
a model)
Data
Data
Entity
Entity class:
Table:
Program
Management
about a
class:
"Branch"
"CHECKING_
module:
"Employee"
ACCOUNT"
ATM
(Metadata)
database "Customer
(a data
model)
"
Attributes:
Columns:
Controller
Attributes: "Employee.Address "Account_number Language:
"Name"
" "Employee.Name"
"
Java
"Birthdate" Role: "Each branch "Monthly_charge"
must be managed
by exactly one
Employee"
IT Operations
Data
Customer
Branch Address:
CHECKING_
ATM
(Instance Data)
about
Name:
"111 Wall Street"
ACCOUNT.
Controller:
real-
"Julia
Branch Manager: Account_number: Java code
world
Roberts"
"Sam Sneed"
= "09743569"
things (a Customer
CHECKING_
database) Birthdate:
ACCOUNT.
"10/28/67
Montly_charge:
"
"$4.50"
Realworl
Julia
d things
Roberts
Wall Street branch Checking account
#09743569
ATM
Withdrawa
l
Extract/ Transform / Load (ETL) design
 Data extraction – extracts data from homogeneous or heterogeneous data sources
 Data transformation – transforms the data for storing it in the proper format or
structure for the purposes of querying and analysis
 Data loading – loads it into the final target (database, more specifically, operational
data store, data mart, or data warehouse)
Since the data extraction takes time, it is common to execute the three phases in parallel.
While the data is being extracted, another transformation process executes. It processes
the already received data and prepares it for loading. As soon as there is some data ready
to be loaded into the target, the data loading kicks off without waiting for the completion of
the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically
developed and supported by different vendors or hosted on separate computer hardware.
The disparate systems containing the original data are frequently managed and operated
by different employees. For example, a cost accounting system may combine data from
payroll, sales, and purchasing.
Extract
The first part of an ETL process involves extracting the data from the source system(s). In
many cases this represents the most important aspect of ETL, since extracting data
correctly sets the stage for the success of subsequent processes.
ETL Architecture Pattern
Most data-warehousing projects combine
data from different source systems. Each
separate system may also use a different
data organization and/or format. Common
data-source
formats
include relational
databases, XML and flat files, but may also
include non-relational database structures
such as Information Management System
(IMS) or
as Virtual
other
data
Storage
(VSAM) orIndexed
structures
Access
Sequential
such
Method
Access
Method (ISAM), or even formats fetched
from outside sources by means such
asweb spidering or screen-scraping. The
streaming of the extracted data source and
loading on-the-fly to the destination
database is another way of performing ETL when no intermediate data storage is required.
In general, the extraction phase aims to convert the data into a single format appropriate
for transformation processing.
An intrinsic part of the extraction involves data validation to confirm whether the data
pulled from the sources has the correct/expected values in a given domain (such as a
pattern/default or list of values). If the data fails the validation rules it is rejected entirely
or in part. The rejected data is ideally reported back to the source system for further
analysis to identify and to rectify the incorrect records. In some cases the extraction
process itself may have to do a data-validation rule in order to accept the data and flow to
the next phase.
Transform
In the data transformation stage, a series of rules or functions are applied to the extracted
data in order to prepare it for loading into the end target. Some data does not require any
transformation at all; such data is known as "direct move" or "pass through" data.
An important function of transformation is the cleaning of data, which aims to pass only
"proper" data to the target. The challenge when different systems interact is in the relevant
systems' interfacing and communicating. Character sets that may be available in one
system may not be so in others.
In other cases, one or more of the following transformation types may be required to meet
the business and technical needs of the server or data warehouse:
Selecting only certain columns to load: (or selecting null columns not to load). For example,
if the source data has three columns (aka "attributes"), roll_no, age, and salary, then the
selection may take only roll_no and salary. Or, the selection mechanism may ignore all
those records where salary is not present (salary = null).
Translating coded values: (e.g., if the source system codes male as "1" and female as "2",
but the warehouse codes male as "M" and female as "F")
Encoding free-form values: (e.g., mapping "Male" to "M")
Deriving a new calculated value: (e.g., sale_amount = qty * unit_price)
Sorting or ordering the data based on a list of columns to improve search performance

Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data

Aggregating (for example, rollup — summarizing multiple rows of data — total
sales for each store, and for each region, etc.)

Generating surrogate-key values

Transposing or pivoting (turning multiple columns into multiple rows or vice versa)

Splitting a column into multiple columns (e.g., converting a comma-separated list,
specified as a string in one column, into individual values in different columns)

Disaggregating repeating columns

Looking up and validating the relevant data from tables or referential files

Applying any form of data validation; failed validation may result in a full rejection
of the data, partial rejection, or no rejection at all, and thus none, some, or all of the
data is handed over to the next step depending on the rule design and exception
handling; many of the above transformations may result in exceptions, e.g., when a
code translation parses an unknown code in the extracted data
Load
The load phase loads the data into the end target that may be a simple delimited flat file or
a data warehouse. Depending on the requirements of the organization, this process varies
widely. Some data warehouses may overwrite existing information with cumulative
information; updating extracted data is frequently done on a daily, weekly, or monthly
basis. Other data warehouses (or even other parts of the same data warehouse) may add
new data in a historical form at regular intervals—for example, hourly. To understand this,
consider a data warehouse that is required to maintain sales records of the last year. This
data warehouse overwrites any data older than a year with newer data. However, the entry
of data for any one year window is made in a historical manner. The timing and scope to
replace or append are strategic design choices dependent on the time available and
the business needs. More complex systems can maintain a history and audit trail of all
changes to the data loaded in the data warehouse.
As the load phase interacts with a database, the constraints defined in the database schema
— as well as in triggers activated upon data load — apply (for example,
uniqueness, referential integrity, mandatory fields), which also contribute to the overall
data quality performance of the ETL process.
For example, a financial institution might have information on a customer in several
departments and each department might have that customer's information listed in a
different way. The membership department might list the customer by name, whereas the
accounting department might list the customer by number. ETL can bundle all of these
data elements and consolidate them into a uniform presentation, such as for storing in a
database or data warehouse.
Another way that companies use ETL is to move information to another application
permanently. For instance, the new application might use another database vendor and
most likely a very different database schema. ETL can be used to transform the data into a
format suitable for the new application to use.
An example would be an Expense and Cost Recovery System (ECRS) such as used
by accountancies, consultancies, and legal firms. The data usually ends up in the time and
billing system, although some businesses may also utilize the raw data for employee
productivity reports to Human Resources (personnel dept.) or equipment usage reports to
Facilities Management.
Real-life ETL cycle
The typical real-life ETL cycle consists of the following execution steps:
1. Cycle initiation
2.
3.
4.
5.
6.
7.
8.
9.
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data integrity,
create aggregates or disaggregates)
Stage (load into staging tables, if used)
Audit reports (for example, on compliance with business rules. Also, in case of
failure, helps to diagnose/repair)
Publish (to target tables)
Archive