Download Training Plan for Stream II

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Functional Database Model wikipedia , lookup

Transcript
Data warehouse Concepts and Basics
Session objectives and scope
• Definition
• Fundamentals of data warehousing, how it works and
what it serves
• Basic Elements of Data warehouse and generic
architecture
• Models – Star Schema, Snowflake Schema
• OLTP v/s OLAP
• DW v/s Data Mart
• What is ETL
• ETL Processes
• OLAP – different types
Ground rules
• Interactive
• Participation amongst all
• Please set your cell phones to vibration mode
Data warehouse : the definition
•
A warehouse is place where goods are physically stocked, to facilitate
smooth flow of business without any production downtime or crisis.
In layman’s word:
• A data warehouse is read only database which copies/stores the data from
the transactional database.
Ralph Kimball:
• A data warehouse is a structured extensible environment designed for the
analysis of non-volatile data, logically and physically transformed from
multiple source applications to align with business structure, updated and
maintained for a long time period, expressed in simple business terms, and
summarized for quick analysis.
What is a data warehouse?
•
•
“A warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management’s decision making process.”
English translation:
– Subject-oriented data refers to data that provides information about a particular
subject (e.g. customers, transactions, products, vendors, etc.) rather than a
company’s ongoing operations.
– Integrated data means that data is gathered from a variety of different sources,
linked together, and then aggregated in one location (i.e. the warehouse) to
establish a comprehensive dataset.
– Time-variant signifies that the data elements vary over time. For example,
customer data in a warehouse should be updated as changes occur to customer
accounts.
– Non-volatile implies that the data in the warehouse is “stable.” That is, new data
is added to the warehouse but old data isn’t removed. Maintaining old data
provides managers with a consistent picture of the business as well as a
historical perspective for analyzing trends. (Note: Some data warehouses can
be volatile. Based on storage constraints, only a certain number of periods of
history are kept in the warehouse.)
•
Data warehouse versus data mart?
– Data mart refers to a single-subject data warehouse.
– Data warehouses are typically enterprise in scope.
Need of data warehouse
•
Analysis of data
•
Flexibility and broad scope
•
Faster retrieval of data
•
Subject-oriented information capture
•
Future projections and top-level control
Some other terms used in the Data Warehousing world
•
ODS (Operational Data Stores)
•
An operational data store (ODS) is a type of database often used as an interim area for a data
warehouse. Unlike a data warehouse, which contains static data, the contents of the ODS are updated
through the course of business operations. An ODS is designed to quickly perform relatively simple queries
on small amounts of data (such as finding the status of a customer order), rather than the complex queries
on large amounts of data typical of the data warehouse. An ODS is similar to your short term memory in
that it stores only very recent information; in comparison, the data warehouse is more like long term
memory in that it stores relatively permanent information.
Data Warehouse
•
A data warehouse is a central repository for all or significant parts of the data that an enterprise's various
business systems collect. Typically, a data warehouse is housed on an enterprise mainframe server. Data
from various online transaction processing (OLTP) applications and other sources is selectively extracted
and organized on the data warehouse database for use by analytical applications and user queries. Data
warehousing emphasizes the capture of data from diverse sources for useful analysis and access.
Data Marts
A data mart is a repository of data gathered from operational data and other sources that is designed to
serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide
database or data warehouse or be more specialized. The emphasis of a data mart is on meeting the
specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and
ease-of-use. Users of a data mart can expect to have data presented in terms that are familiar.
Where does the “data” come from? Where does it go?
•
•
Data is selectively extracted from online transaction processing (OLTP)
applications and various other sources and consolidated, cleansed,
organized in a separate database which then becomes the warehouse.
A data warehouse is typically located on an server different from the one
that houses most enterprise-wide applications (e.g. payroll, email server,
customer data portals, etc.)
Data Warehousing Lifecycle
Fragmented
data sources
(Metadata)
Staging
Extract,
transform,
& load
Cleansing
Consolidate
(Warehouse)
Export
(Marts)
Assign
Semantics
Access &
Analyze
Hence..
•
Data warehouse is a concept/architecture, not a product
•
Set of hardware and software components used to analyze huge amount of
data
•
Intelligent way of managing data
•
Data -> Information -> Knowledge -> Decision -> Control
Step-wise break-up of warehousing activity
Source systems
Data warehouse storage
OLAP Server
End user views
Generic Architecture
Presentation
Layer
Warehouse
administration
Source 1
Semantic
Layer
Universes
Static
Report
Data Mart
Source 1
ODS
Catalogues
Source 3
Extract,
transform &
load
Data
Warehouse
• Consolidated to
one data
warehouse
Dashboards
Data Mart
MDB
Source 4
• Operational
and expanded
data
Data Mart
Variable
Report
MDB
Source N
Data extraction,
transformation & load
Portal
Metadata
Sources (Mainframe
/ Non-Mainframe
Metadata Repository
Operational Data Store
Data
warehouse
Data marts
Data access
and analysis
Source Systems
•
•
•
•
An operational system of record whose function is basically to capture
the transactions of the business.
A source system can be a “legacy system” in a mainframe environment
or can be “flat files” or can also be from “multiple RDBMS systems”.
The main priorities of the source system are uptime and availability.
Source systems are not queried in the broad and unexpected ways as
compared to that of the data warehouses which are typically queried.
Data staging area
•
•
•
•
•
A storage area and set of processes that clean, transform, combine,
duplicate, archive, and prepare source data for use in the data
warehouse.
The data staging area is everything in between the source system and
the presentation server.
The staging area is dominated by the simple activities of sorting and
sequential processing.
The data staging area does not need to be based on relational
technology.
Data staging area does not provide query and presentation services.
Semantic Layer illustration
•
•
•
The semantic layer defines a set of “logical” relationships which translate complex data
terminology into business terms that will be familiar to end-users (e.g. product, customer, revenue,
etc.)
For example, if a user wants to create a report showing product revenue generated from all
customers for the past three years, the user simply chooses four objects: customer, product,
revenue and time.
Thanks to the semantic layer, the user only has to be familiar with the business terminology
relevant for his/her query. (There is no need to understand the back-end database structure.)
End-user
User Interface
Semantic Layer
Semantic Layer acting as
primary interface to DB
tables. Hiding the
underlying data
complexity
Data Schema
Source data
Presentation layer illustration
• The presentation layer enables users to customize the way data (and data analysis) is presented to
them.
• Through the presentation layer, users can create custom reports, access standard reports and
access “dashboards” with drill-down capabilities as well as alerting functions.
Summary of different layers
•
•
The target physical machine on which the data warehouse data is organized
and stored for direct querying by end users, report writers, and other
applications.
To Summarize, three very different systems are required for a data warehouse
to function:
– the source system
– the data staging area
– and the presentation server.
•
•
•
The source system should be outside the DW.
The data staging area is described as the initial storage and cleaning system
for DW.
The presentation server presents and stores in a dimensional framework.
What is the process for developing a data warehouse?
Formulate data
management strategy
What are the goals for the data warehouse?
(e.g. improved customer insights)
Clean data and
populate warehouse
Data should be cleaned to the extent possible.
Design the schemas, make pre-built total
tables, create semantic layer design,
establish meta data design, etc.
Architect data
warehouse
Include a business intelligence solution
to enable business user interaction.
Build prototype for
end-user testing
This is a critical step because business users
often do not understand their requirements
until they interact with the system.
Test prototype with
end-users
Incorporating end-user feedback
ensures a robust design that satisfies
business requirements.
Finalize warehouse
design based on
feedback
What are the business benefits associated with data warehousing?
Qualitative Benefits
•
•
•
Data analysis
– Integrated data provides complete picture
of enterprise
– Time-variant data enables easier trend
analysis
– Subject-oriented data allows decision
makers who can refine their queries
– Custom views of data at different levels of
aggregation and along different
dimensions
– Reduced complexity of user queries
through “semantics” layer.
Data quality
– Data integration provides “single version
of the truth”
– Proactive data cleansing improves data
quality
– Creation of a corporate “data dictionary”
resolves differences in terminology
Redirected resources
– Frees up IT resources from report creation
– Reduces traffic on On-Line Transaction
Processing (OLTP) systems
Quantitative Benefits
•
•
•
Process-specific cost reduction: reduces
costs incurred due to lack of information or
wrong information. For example:
– Lack of integrated data available for
forecasting can lead to carrying too
much/little inventory
– Lack of timely information on defaults in
credit cards can cause increased bad
debts on other loans/debt
Process-specific growth opportunity:
reduces missed growth opportunities due to lack
of information. For example:
– Lack of integrated view of customer
buying behavior impeded cross selling
Infrastructure cost reduction
– Automation of existing manual process
reduces IT involvement and monitoring.
– Reduction in data marts and associated
hardware when implementing an
enterprise data warehouse
– Rationalization of disparate sets of tools
used for reporting
Data warehousing notions and reality…
Notion
Reality Check
Data warehouse is a database
that stores data
•Although the data for the data warehouse resides in a
database (Oracle, DB2, etc.), it should not be confused with an
application/transactional database. Any changes to the data
should take place in the applications.
•The data in the data warehouse is specifically used for
querying. It improves the speed of access to information
upon which decisions are made.
The data warehouse is going to
contain clean data
•Data Warehouse data is as clean as the data it receives from
the Source Systems.
•A perfectly clean data environment is likely to happen only
over a period of time as we make process and systems
changes within marketing and across other functional areas
within the organization.
The data warehouse is going to
contain data from all sources so
that they can be used any time if
needed in the future
This is also likely to happen over a period of time. The
recommendation is to phase in the implementation of the
data warehouse to get to this end goal.
Data Warehouse End-to-End Development Process
Discover
Architect
• Study Business Drivers/validate
vision phase
• Understand the vision for analytics
• Drill down to business capabilities, with
focus on a particular analytics (could be
sales analytics, marketing analytics and so
on)
• Prioritize overall enterprise analytics needs
and understand at high level
• Study As Is Analytics landscape
phase
• Current BI capabilities, tools, technologies
• Determine root cause for existing problems
• Study environment complexity: Data
sources, diversity, data quality issues,
technical infrastructure
• Current BI governance set up
• Define robust BI Solution
Framework phase
• Scaleable data architecture to provide data
backbone for all analytics needs
• Optimal approach/methodology for iterative
build
• Reporting/BI strategy that meets canned,
ad hoc & power user needs for analytics
• High level roadmap with key milestones for
realizing the enterprise information
architecture.
Construct
•Data Model Design
- Relational
- Multi-Dimensional
•Capacity Planning
•Performance profiling
•Detailed Architecture
definition
•Prototype
•Database creation
•Unit testing and UAT
plans
•Create Manuals
-Operation Manual
-System Manual
-Installation Notes
-User Manual
Design
•Dimensions and
Measures
•Aggregation/Analysis
•Distribution/Security
•Operational Batch
Configure
•BI tool Meta Data
•Users/access Rights
•Report Distribution
•Operational Batch
•Program Aggregation/
Analysis
•Configure ETL tool
Metadata
•Program ETL
•Configure Operational
Batch
Design
•Extract/ Data Quality
•Data Transformation
•Operational Batch
Design of Metadata
Metadata
•Integration of Metadata •Integrate Metadata
•Publish Data Dictionary •Publish Metadata
Validate
Prepare for System
Testing
•Functional Test Plan
•Performance Test
Plan
•Data Quality Test
Plan
Deploy
Prepare
deployment/ Roll
out plan
Migrate to
production
Execute System Test environment
•Functional Test Plan
•Performance Test
Plan
•Data Quality Test
Plan
Set up end user
desktop
environment
Performance Tuning
•Extracts, Loads
Set up ‘Help
•Quality Check
Desk’
•Transformation
•Aggregation
•Caching of Reports
Roll Out solution
•Design usage profiling
Design
•Partitions, Indexes
•Execution plans ….
User Training
•BI Tools Training
Optimize the warehouse •System Training
database, ETL scripts
UAT /Enhancements
custom components
•Extracts, data Quality
•Transformations
•Aggregation, Analysis
Construct Custom Components
•Extracts, Transformations
•Quality Checks
•Aggregation, Analysis
-Discover phase and Architect phase can be treated as Requirement Analysis and Design
phase respectively
-Construct phase refers to the Build phase
Production
Support
Batch
operations
•Periodic Run of
operational
Batch
•Error logging,
reporting and
rectification
•Minor
Enhancements
Architecture Components & Providers
Component
Providers
Source Systems - Operational or legacy systems of record whose function it is to capture the
transactions of the business. The data in these systems can be in many formats from flat files to
hierarchical, and relational databases.
• Infinite
possibilities
Data Warehouse Database - The data warehouse database collects and stores integrated sets of
historical, non-volatile data from multiple operational systems and feeds them to one or more data
marts. It becomes the one source of the truth for all shared data.
•
•
•
•
•
Data Marts - All data marts extract data from the data warehouse. The data marts contain subject
specific information supporting the requirements of the end users in individual business units. Data
marts can provide rapid response to end-user requests using aggregates stored in the data mart.
• SQL Server
• Oracle
Extract Transform Load (ETL) - ETL tools are used to extract data from data sources, cleanse
the data, perform data transformations, and load the target data warehouse. The ETL tool is also
used to load the data marts. The ETL tool maintains a metadata repository of all source to target
transformations, this information is available to end users for research and problem resolution.
•
•
•
•
Metadata Repository - Offers a way to understand what information is available, where it comes
from, where it is stored, the transformation performed on the data. Technical metadata can be
generated and maintained by an ETL tool as part of the specification of the extraction,
transformation and load process. The repository can also capture the operational statistics during
the operation of the ETL process.
• Computer
Associates
• Allen Systems
Group
Data Modeling Tool - Used to specify logical and physical models for data warehouse tables, and
to create meta data for these tables, which can then can be imported to the central meta data.
• Erwin
• Embarcadero
Teradata
SQL Server
Netezza
Oracle
etc
Informatica
SAS
Hummingbird
etc.
Architecture Components & Providers (cont’d)
Component
Providers
Data Cleansing Tool - Used to remove errors and inconsistencies from data being imported to a
data warehouse.
• First Logic
Semantic Layer - A logical layer of defined 'business objects', which are objects with business
terminology that are mapped to the respective tables and columns in the data warehouse. This
layer enables end-users to “ask” questions to the data warehouse in relevant business terminology.
(See following slides for illustration)
• Business Objects
• Cognos
• Hyperion
Presentation Layer - Provides functionality for users to specify queries, create arbitrary reports,
and to analyze their own data using drill-down and On-Line Analytical Processing (OLAP) functions
(See following slides for illustrations)
• Business Objects
• Cognos
• Microstrategy
Warehouse Administration - Tools used to monitor end user activity inside the warehouse and/or
the data marts, monitor loads times to the warehouse and ensure that the database is optimized
for performance.
• Ambeo
• Bradmark
Technology, Inc.
• Embarcadero
Technologies, Inc.
Dimensional Model
•
•
•
•
A specific discipline for modeling data that is an alternative to entity
relationship (E/R) modeling.
Required to make database simple and understandable
Gives user the ability to visualize data
The main components of a dimensional model are fact tables and
dimension tables.
– A fact table is the primary table in each dimensional model that is
meant to contain measurements of the business.
– A dimension table is one of a set of companion tables to a fact
table. Each dimension is defined by its primary key that serves as
the basis for referential integrity with any given fact table to which it
is joined.
DW Schema Models
•
•
•
We have two schema models which are suitable for Data Warehousing
in most of the cases.
Star Schema A star schema is a set of tables comprised of a single,
central fact table surrounded by de-normalized dimensions. Each
dimension is represented in a single table.
Snowflake Schema If you normalize the star schema dimensions to
separate tables and link them together, you will have a snowflake
schema
Star Schema
D
D
Time
Product
F
Sales
Amount_sold
Product_ID
Customer_ID
Channel_ID
Date
D
Customer
D
Channel
Snowflake Schema
D
•
•
•
Defines hierarchies by using multiple dimension tables
Is more normalized than a single table dimension
Required if hierarchical/ level-wise details are to be
displayed on reports
Product
Difference in OLTP and OLAP
OLTP
1.
2.
3.
4.
5.
6.
Captures transactional information
necessary to run business
operations
Need performance
More DML operations
OLTP systems often use fully
normalized schemas to optimize
update/insert/delete performance
and data consistency
A typical OLTP operation accesses
only a handful of records
OLTP systems usually store data
for only a few weeks or months if
needed
OLAP
1.
2.
3.
4.
5.
6.
Analyze transaction information at
an aggregate level to improve the
decision-making process
Need flexibility and broad scope
Very rare DML operations
Denormalized or partially
denormalized (star schema) to
optimize query performance
A typical DW query scans
thousands/millions of rows.
Considerable historical data is
maintained for analysis
Data warehouse v/s Data Mart
•
•
•
•
•
Data mart - A logical subset of the complete data warehouse.
A data mart is a complete “pie-wedge” of the overall data warehouse
pie.
A data warehouse is made up of the union of all its data marts.
A data warehouse is fed from the staging area.
Every data mart must be represented by a dimensional model and,
within a single data warehouse, all such data mart must be built from
dimensions and facts.
What is ETL ?
•
•
•
•
•
Can also be called as ETVL
Extraction
Transformation
Validation
Loading
Major ETL Tools
•
•
•
•
•
•
•
Informatica
IBM Datastage
Ab-initio
Microsoft DTS package
Oracle Express
SAS
Hummingbird
Few ETL process considerations
•
•
•
•
•
Tracking dimensions – SCDs can be difficult to maintain during
incremental/delta updates of the dimension tables.
Change Data Capture – This is the ability to identify records in source
systems which have been modified since last extraction. The source system
should capture the change and ETL should use the change captured to
extract relevant records
Artificial keys – generation of keys when inserting records into tables having
artificial keys need to be taken care of
Aggregation of data – This is required for population of aggregation-level
fact tables and when more hierarchical representation of data is required at
report-level
Scheduling – Time based and conditional scheduling ease the
administration of ETL process
SCDs (Slowly Changing Dimensions)
•
•
•
•
•
It is a concept of extracting recently modified data.
Generally 3 types
Type1 : Keep most recent values in the target
Type2 : Keep full history of changes in the target
2.1 Keep version number in separate column
2.2 Mark the current dimension record with a flag
2.3 Mark the dimension records with their effective date range
Type3 : Keep the current and previous values in the target
OLAP (OnLine Analytical Processing)
•
•
•
•
ROLAP (Relational OLAP)
MOLAP (Multidimensional OLAP)
DOLAP (Desktop OLAP)
HOLAP (Hybrid OLAP)
ROLAP (Relational OLAP)
With ROLAP data
remains in the
original relational
tables, a separate set
of relational tables is
used to store and
reference
aggregation data.
ROLAP is ideal for
large databases or
legacy data that is
infrequently queried.
MOLAP (Multidimensional OLAP)
1.
High performance, multidimensional data storage format
2.
Data is stored on the OLAP server
3.
Gives the best query performance
4.
For small to medium-sized data sets
HOLAP (Hybrid OLAP)
HOLAP combines elements from MOLAP and
ROLAP. HOLAP keeps the original data in
relational tables but stores aggregations in a
multidimensional format.
DOLAP (Desktop OLAP)
•
•
Stores in the form of cubes/micro-cubes on the desktop/client machine
BusinessObjects typically stores data in this format
Major OLAP/Reporting Tools
•
•
•
•
BusinessObjects
Microstrategy
Brio
Cognos
OLAP Dimensions v/s Relational Dimensions
OLAP
Region
West
CA
OR
East
MA
NY
Relational
Region
West
East
State
CA
OR
MA
NY
Region
West
West
East
East
Defining a Cube
Atlanta
Chicago
Denver
Grapes
Cherries
Melons
Apples
Detroit
Q1
Q2
Q3
Time Dimension
Q4
Querying a Cube
Sales
Fact
Atlanta
Chicago
Denver
Grapes
Cherries
Melons
Apples
Dallas
Q1
Q2
Q3
Time Dimension
Q4
References
•
•
•
•
•
www.datawarehouse.com
Building the data warehouse – W.H.Inmon
White papers published by Informatica and BusinessObjects
www.dmreview.com
Infosys Knowledge repository - KShop
Thanks