Download Build a Metadata Driven ETL Platform by

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Data analysis wikipedia , lookup

PL/SQL wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

SQL wikipedia , lookup

Clusterpoint wikipedia , lookup

SAP IQ wikipedia , lookup

Next-Generation Secure Computing Base wikipedia , lookup

Metadata wikipedia , lookup

Information privacy law wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

3D optical data storage wikipedia , lookup

Database model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Build a Metadata-Driven ETL Platform by
Extending Microsoft SQL Server Integration
Services
SQL Server Technical Article
Writers: Tianying He, Mike Gudyka
Technical Reviewer: Darvey Lavender
Contributors: Eric Sullivan, John Barrows, Erik Swanson, Bob Rohde, Pankaj Sharma,
Veronica D'Souza, Shawn Archer, Adrian Hill, Paul Zangaglia, Lucas Hryniewicki,
Amaranath Dabbara
Published: March 2008
Applies To: SQL Server 2008
Summary: SQL Server 2008 Integration Services (SSIS) provides a flexible and
scalable architecture that enables high-performance data extract, transform, and load
(ETL). The Microsoft Business Intelligence Center of Excellence has extended SSIS to a
metadata-driven platform to more effectively build, deploy, and manage ETL processes
in large data warehousing environments.
Copyright
This is a preliminary document and may be changed substantially prior to final
commercial release of the software described herein.
The information contained in this document represents the current view of Microsoft
Corporation on the issues discussed as of the date of publication. Because Microsoft
must respond to changing market conditions, it should not be interpreted to be a
commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of
any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced, stored
in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as
expressly provided in any written license agreement from Microsoft, the furnishing of
this document does not give you any license to these patents, trademarks, copyrights,
or other intellectual property.
 2008 Microsoft Corporation. All rights reserved.
Microsoft, Excel, and SQL Server are either registered trademarks or trademarks of
Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks
of their respective owners.
Table of Contents
Introduction ......................................................................................................1
Challenges and Product Limitations ..................................................................1
Enterprise Standard......................................................................................... 1
Developer Productivity ..................................................................................... 1
Data Lineage and Audit Trail ............................................................................. 2
Scale Out with Commodity Hardware ................................................................. 2
Usage Scenario of Metadata-Driven ETL ............................................................2
Platform Architecture ........................................................................................3
Metadata Repository .........................................................................................4
Builder ..............................................................................................................5
Defining SSIS Control Flow ............................................................................... 5
Dynamic Generating SSIS Packages .................................................................. 7
Controller and Worker .......................................................................................8
Distributed Execution ....................................................................................... 8
Unified Logging ............................................................................................... 9
Monitor .............................................................................................................9
Metadata Editor ...............................................................................................10
ETL Pattern Library .........................................................................................10
Further Enhancements ....................................................................................10
Metadata Repository Manager ......................................................................... 10
Business Rule Engine ..................................................................................... 10
Data Quality ................................................................................................. 11
Putting It Together ........................................................................................ 11
Conclusion.......................................................................................................11
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
1
Introduction
Microsoft SQL Server™ 2008 Integration Services (SSIS) enables enterprise customers
to create, deploy, and manage high-performance data integration solutions. Some of
the most common scenarios of using SSIS are building data warehouses (DW) and
developing business intelligence (BI) solutions. A data warehouse is defined by Bill
Inmon as “a subject oriented, integrated, time variant, nonvolatile collection of data in
support of management decision.” Extract, transform, and load (ETL) is a crucial
process in data warehousing that involves extracting data from outside sources,
transforming it to fit business needs, and ultimately loading it into the end target,
usually the data warehouse. ETL is an important part of the process of bringing
heterogeneous and asynchronous source extracts into a homogeneous environment.
SQL Server Integration Services can pull data from a wide variety of sources including
relational databases, Microsoft Excel® files, XML files, and flat files. It also includes a
rich set of tools and components for developing and executing ETL packages. You can
create SSIS packages manually by using SQL Server Business Intelligence Development
Studio or programmatically by using SSIS APIs.
Although SSIS offers powerful capabilities for building robust ETL solutions, customers
still face many challenges when implementing large-scale data warehouses. This paper
describes how to extend SSIS to a metadata-driven platform to better address those
challenges.
Challenges and Product Limitations
The following sections discuss the challenges of developing large enterprise data
warehouse systems and the limitations of SQL Server 2008 Integration Services.
Enterprise Standard
For a large data warehousing system in an enterprise environment, it is important to
standardize ETL processes, including unified logging, checkpoint, and exception
handling. A standard is a description of precise behaviors or actions that can help to
prevent the creation of different flavors of SSIS packages, which can make them
difficult for other developers to understand. Standards also help to improve the
productivity of the team, the quality of the application, and the maintainability and
understandability of the system. Although creating SSIS packages based on predefined
templates is a good practice, this paper introduces a comprehensive metadata-driven
approach to enforce enterprise standards.
Developer Productivity
Developers can create and deploy SSIS packages by using SQL Server Business
Intelligence Development Studio, which offers a flexible way to define and execute ETL
tasks. While SQL Server Business Intelligence Development Studio is very effective for
designing individual and simple ETL packages, for large data warehousing systems with
hundreds of packages, it is very labor intensive and error prone to develop, test,
deploy, and maintain a large number of SSIS packages for data acquisition, integration,
and distribution. A cost effective alternative is to enable developers to define ETL
processes using metadata definitions without the need to worry about common tasks
such as logging and exception handling. ETL patterns are custom implementations of
Microsoft Corporation ©2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
2
data movement, transformation, integration, and quality management. Reusable ETL
patterns improve developer productivity. ETL patterns can be defined using metadata
and implemented by using custom store procedures or managed code.
Data Lineage and Audit Trail
Enterprise customers need advanced BI capabilities, such as using data lineage to track
data integration from operational systems to BI reports. Data lineage helps to show
where the data came from and what rules were applied to data along the way. It
provides a complete view of the data lifecycle. This type of advanced BI capability
requires an integrated metadata system. While the SQL Server BI toolset is mostly
metadata-driven internally, every major component of SQL Server, from database
tables to XML files, keeps metadata in its own independent structure with its own
access methods. Metadata are kept in different formats across Microsoft SQL Server
Integration Services (SSIS), Analysis Services (SSAS), and Reporting Services (SSRS).
While you can use the Microsoft® SQL Server™ 2005 BI Metadata Sample toolkit to
analyze and view dependencies and lineage of objects, it is mostly used for impact
analysis.
Scale Out with Commodity Hardware
By design, the SQL Server Integration Services pipeline is almost exclusively in
memory. The potential disadvantage of this is that for large amounts of data and
complicated transformations you need a large amount of memory. SSIS does not
support Advanced Windows Extensions (AWE). Scaling out to multiple packages across
processes is the only way to take advantage of larger amounts of memory. For very
large memory requirements, consider using 64-bit systems. High-end computers can be
expensive and customers are looking for alternatives with commodity hardware.
Instead of scaling up, scaling out by using distributed processing on commodity
hardware can be a cost-effective option. Distributed processing not only improves the
scalability of the system, but it also increases the reliability of the system by removing
single points of failure.
To address these challenges, this paper presents the design of a metadatadriven ETL platform that focuses on optimizing the acquisition, integration, and
distribution needs of enterprise data warehouses. The platform is
complementary to SQL Server Integration Services. It helps reduce the total
cost of ownership of large enterprise data warehouse systems and BI solutions.
Usage Scenario of Metadata-Driven ETL
To discuss metadata-driven ETL, we must first understand what metadata is. In short,
metadata is data about data. Metadata is used to add context for the data or hide
complexity from users who do not need to know or understand the details of the data.
Metadata is classified by usage as technical metadata and business metadata. In the
ETL process, developers primarily deal with technical metadata.
Following is a usage scenario describing a metadata-driven ETL development process.
In the scenario, developers extract data from a relational data source to a relational
database. Data transformation during the data movement is not included. The basic
flow of the extraction process is as follows:
Microsoft Corporation ©2008
Filename: Document1
3
1. The ETL developer defines the source and destination of the extraction process,
including servers, relational databases, and tables.
2. For each table, the schema (columns, index, and constraints) can either be
automatically retrieved by the system or manually specified by the ETL developer.
3. The ETL developer specifies the mapping between source and destination at the
database, table, and column level.
4. The ETL developer selects either a full or delta load.
5. The ETL developer configures an orchestration process. For each table, ETL
developers define one or multiple steps for performing the extraction.
6. The ETL developer specifies how ETL packages should be executed—at a scheduled
time or on demand.
7. The system dynamically generates one or more SSIS packages.
8. The system deploys the SSIS packages to a distributed execution environment.
9. The system executes ETL jobs and captures the job status.
The outcomes of the usage scenario are:

Metadata defining an end-to-end extraction process is captured and stored in a
metadata repository.

SSIS packages are executed. Data is moved from source systems to destinations.

The ETL job status is captured and logged.
Typical extraction-related metadata includes:

System environmental parameters, such as server name, data source name, folder
location, and connection parameters

List of tables to be extracted

Columns to be extracted for each table

Delta detection of each table

Source-to-destination mapping for data movement

ETL processes and job schedules

Number of retries in case of failure

Logging parameters
This usage case provides a high-level description of a data extraction process to
illustrate the design of the platform. Delta load is the process whereby only changes in
the source table(s) are loaded to the warehouse database. There are many other use
cases for data extraction, transformation, and load. The following sections demonstrate
how to extend SSIS to support a metadata-driven ETL approach.
Platform Architecture
The design goals of the platform include improving the productivity of developers,
enforcing ETL standards, supporting a cost-effective way to deploy large data
warehouses on commodity hardware, and providing a centralize metadata repository for
lineage tracking. The solution architecture of the platform is shown in Figure 1. The
intent of this paper is not to document all the implementation details. Rather, it
describes the concepts and key components of the platform and how they are
Microsoft Corporation ©2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
4
connected with SQL Server Integration Services. For a more detailed architecture
diagram, see Figure 6 at the end of this paper.
Figure 1: Platform architecture
The key components of the platform include:

Metadata repository. Stores the ETL definitions, which describe the data sources,
destinations, data mappings, data transformations, and orchestration processes

Metadata editor. Used for managing metadata via a graphical user interface

Builder. Generates SSIS packages based on metadata stored in the metadata
repository

Controller/worker runtime environment. Consists of a controller and a number
of workers to manage the distributed execution of SSIS packages and unified
logging

Logging repository. Stores status data for building and executing SSIS packages

Monitor. Used for monitoring and reporting system status
The following sections describe each component in detail. With an open architectural
centered around metadata repository, this platform can be further extended.
Metadata Repository
The metadata repository is used to store technical and business metadata, including but
not limited to data source and destination definition, data movement and pattern
definition, and orchestration process definition. It can be further extended to include
business rule definition, data quality metric definition, and so on. Figure 2 shows a
sample metadata model1. Note that this model is for illustration purposes only and does
not necessarily reflect the actual data model we implemented. In the example:
Data package defines the data source and destination entities to be used in the ETL
process. Data packages can be implemented as a hierarchy and includes data groups
and data elements. For relational data sources, a database is a type of data package, a
table is a type of data group, and a column is a type of data element.
The data model is based on the book Universal Meta Data Models, published by David
Marco.
1
Microsoft Corporation ©2008
Filename: Document1
5
Data movement defines the mapping between the source and destination. The mapping
can be done at multiple levels. For relational databases, it can be at the database,
table, and column level.
Data transformation defines the transformation rules that will be applied to the data
movement. Transformation rules can be implement using store procedures and
managed coded. Reusable code can be saved as ETL patterns.
In addition, other metadata is required by the platform. For example, data
orchestration defines how the ETL jobs will be executed and the precedence of tasks.
Figure 2: Sample metadata model
The focus of this paper is not the details of the metadata model. What is important is
that the metadata repository is the hub of the platform.
Builder
The builder is designed to automatically generate SSIS packages and instances of
custom code based on metadata definitions. Before the release of SSIS, many
organizations developed their own custom code to perform ETL. The SSIS makes it
possible for organizations to leverage their existing investments in data integration and
take full advantage of features such as unified logging and distributed execution.
Standards and best practices are enforced by the builder.
Defining SSIS Control Flow
Control flow is a key component of SQL Server Integration Services. This section
explains how to define control flows by using a predefined template. SSIS provides
powerful and flexible tools out of the box. While flexibility is important for some
Microsoft Corporation ©2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
6
customers, other customers may place more emphasis on standardized processes. For
enterprise customers, enforcing standards not only improves the productivity of
developers, but also helps reduce maintenance and operational costs. Instead of using
SQL Server Business Intelligence Development Studio to design SSIS control flows, we
provided an orchestration template as an abstraction layer. With this template, we can
enforce standards and apply best practice behind the scene. This is especially useful for
large data warehousing development.
In the orchestration template, ETL processes are defined within a simple hierarchy,
consisting of systems, processes, and tasks. A task is represented as a unit of running
code, such as the execution of a stored procedure or a SSIS data flow task. Processes
are simply a means of defining groups of tasks. The hierarchy can be described in
outline form:
System
SystemVariable
SystemConnection
Process
ProcessPrecedent
Task
TaskPrecedent
A system is analogous to an SSIS package, a process is analogous to an SSIS sequence
container, and a task is analogous to an SSIS task object. Figure 3 shows an example
of how an ETL package can be defined by using the orchestration template.
Figure 3: Defining SSIS control flow by using a template
By default, all tasks and processes run in parallel. Precedents can be set either at the
task or process level. In Figure 3, the Load Dimension task must complete before the
Load Sales Facts task can start, and all Extract Sales Data tasks must complete before
any Load Sales Data tasks can start. Precedents can be qualified with an expression, by
using SSIS expression syntax. ETL processes that are defined by using the
orchestration template are stored in the metadata repository and used by the builder to
generate SSIS packages.
Microsoft Corporation ©2008
Filename: Document1
7
Dynamic Generating SSIS Packages
SSIS packages can be created programmatically based on defined metadata. The
following SSIS namespace and objects are used to generate the package:

Microsoft.SqlServer.Dts.Runtime













Application
Connections
ConnectionManager
DtsContainer
Executable
LogProviderBase
Package
PrecedenceConstraint
Sequence
Task
TaskHost
Variable
Microsoft.SqlServer.Dts.Tasks.ExecuteSQLTask
On the Microsoft Developer Network (MSDN), you can find documentation and sample
code that shows how to programmatically create SSIS packages. For more information,
see Building Packages Programmatically.
Figure 4 shows a portion of the SSIS package generated for the example described in
Figure 3. Because even in simple scenarios an SSIS package with proper exception
handling and logging can be complex, it would not be easy to manually create the
package from scratch by using SQL Server Business Intelligence Development Studio.
Generating SSIS packages based on metadata definitions automates numerous
repetitive housekeeping tasks, reduces the risk of errors, and enhances developer
productivity.
Microsoft Corporation ©2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
8
Figure 4: Example of an SSIS control flow diagram generated by the builder
At run time, the builder employs the orchestration template to dynamically create SSIS
packages from the metadata specification and to deploy to the target worker servers.
The next section explains the distributed execution of ETL packages based on a
controller and worker architecture.
Controller and Worker
The run-time environment of the platform is based on a controller/worker architecture.
The following sections provide more detail on distributed execution of SSIS packages
and unified logging by using controllers and workers.
Distributed Execution
In the architecture to support distributed execution, ETL operations can occur on one of
many worker servers, while process-defining metadata resides on a centralized
controller server. The package executed on the worker server sends progress reports
back to the controller and write events to the standard logging system. The controller
and worker architecture is shown in Figure 5.
Microsoft Corporation ©2008
Filename: Document1
9
Figure 5: Controller and worker architecture
The controller server hosts a central metadata database. It also hosts the components
to create and deploy SSIS packages, including the builder. The worker server hosts the
SSIS packages and client logging components. While the system supports distributed
processing, it is not mandated by the platform. Both controller and worker components
can be installed and run on the same server.
Unified Logging
Logging in this platform consists of a client and a central component. A client logging
component runs on the worker server to collect and manage logging events. This
component uses SQL Server Services Broker to define a logging conversation and send
messages from the worker server to the centralized logging repository. The ETL process
produces common output that provides the ability to review the status of ETL jobs. In
addition, an SSIS log provider and a logging interface for custom code are provided to
enable all messages, including the SSIS log, platform run-time messages, and custom
code events, to be written to the same logging stream.
Monitor
The monitor provides a consistent and user-friendly interface for reporting on system
status. The tool can report the current status and historical statistics in an easily
consumable format. The monitoring tool facilitates a better customer experience as well
as saving debugging and troubleshooting time. The monitoring tool can be further
Microsoft Corporation ©2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
10
enhanced with proactive notification and integration with Microsoft System Center
(formerly named Microsoft Operations Manager).
Metadata Editor
Currently, developers use templates stored in a Microsoft Excel file for data entry. An
import wizard is used to load metadata definitions from Excel worksheets to a central
metadata repository. Further improvements to the ETL platform may include tools to
receive metadata (tables, columns, and relationships) from data modeling tools. A
metadata editor is planned but not yet implemented to capture and maintain metadata
definitions for ETL. It will provide a friendly user interface for creating, retrieving,
updating and deleting metadata.
Note: Not all metadata must be captured manually. The metadata editor should be
capable of performing scans and discovering source database schemas in order to
build initial target schema as well as to uncover source system changes that may
impact ETL jobs.
ETL Pattern Library
Recent studies show that almost half of all ETL implementations use hand-coded
extracts for some or all of the work. When we look at data integration between
operational systems for migrations and consolidations or for real-time ETL, custom
coding is even more common. It would be valuable for organizations to leverage their
existing investments in custom code. In this platform, extended from SSIS, developers
can take full advantage of features such as logging, debugging, and BI integration by
wrapping the custom code as re-usable ETL patterns. An ETL pattern can be
implemented by using SQL stored procedures or managed code by using .NET. The
platform already includes many commonly used ETL patterns, such as extract, slowly
changing dimensions, and Change Data Capture APIs. It enables ETL developers to
extend the pattern library with their own reusable code.
Further Enhancements
The ETL platform can be further extended to include features described in sections
below.
Metadata Repository Manager
Because of the increased complexity of the metadata repository, the platform will have
a graphical user interface (GUI) to support administrative features. For example, the
GUI can be used to configure repository security, allowing specific user groups access
repository data. The repository manager should be able to provide commonly used
administrative functionalities.
Business Rule Engine
Instead of implementing business rules in custom code for data movement,
transformation, and quality management, business rules can be stored in a metadata
repository and consumed by a business rule engine. This will enhance corporate
standards, improves data quality, and supports the auditing and tracking of data
lineage.
Microsoft Corporation ©2008
Filename: Document1
11
Data Quality
Data quality metrics will be used to measure the completeness, validity, consistency,
and accuracy of data. For example, data quality metrics can ensure that nonconforming data is flagged as an error, loaded to an error staging area, and not loaded
to the data warehouse until corrected. A successful data quality management strategy
involves three key tasks: profiling, cleansing, and auditing. The metadata repository
can be further extended to store metadata about data quality.
Putting It Together
A proposed system architecture, with future enhancements, is shown in Figure 6. It
highlights major components of the platform as well as how the platform is integrated
with SSIS.
Controller/Worker
s
rie
MetadatantRepository
gE
Lo
Load Balancer
Designer
ETL Process Metadata
SSIS Runtime
Metadata Editor
Package
Metadata
Schema Metadata
Business Rules
Task
Builder
M
Metadata Repository
Manager
et
ad
at
a
ETL Pattern Library
SSIS Package
Generator
Task
Container
Task
ETL Pattern Instance
Generator
Task
Task
Packages
BizRule Engine
Task
Logging Repository
Da
ta
a
at
ad
et
M
Package Deployer
Real Time Data
St
at
us
Monitor
SSIS
Data Flow
Components
ETL Pattern
Instance
SSIS Adapter
Custom Data
Adapters
Real Time Monitor
Status Data
Historical Data
Analyzer
Log Entries
Logger
Figure 6: Proposed future system architecture
Besides generating SSIS packages dynamically by using metadata definitions, the
platform also enables SSIS packages to be executed in a distributed environment with
consistent logging. It also includes reusable code and best practices implemented as
ETL patterns. Those patterns complement SSIS data flow components for common ETL
operations. A set of standardized APIs will be provided for accessing the metadata and
logging repositories.
By extending SSIS to support metadata-driven ETL, the platform helps to address many
of the challenges encountered when building large data warehouses systems and BI
solutions.
Conclusion
Microsoft SQL Server Integration Services 2008 (SSIS) offers great capabilities for highperformance ETL and a cost effective product for developing enterprise data warehouse
solutions. By standardizing ETL processes, improving developer productivity, and
reducing operational cost, the metadata-driven ETL platform built on top of SSIS
described in this paper enables enterprise customers to reduce the total cost of
ownership of data warehousing systems.
Microsoft Corporation ©2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
12
For more information:
SQL Server Web site
SQL Server TechCenter on Microsoft TechNet
SQL Server DevCenter on MSDN
Microsoft Corporation ©2008