Download Blueprint for DoitYourself Metadata

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IMDb wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Concurrency control wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Blueprint for Do­it­Yourself Metadata
by Bob Lambert
AUG 27, 2010 5:45pm ET
Picture this: You’re the tech lead on a large ETL project, integrating data from 17 suppliers and five internal systems
for a logistics management application. The system will enable planners to optimize the supply chain with integrated
order, shipment and production data. The team is finalizing design, having already defined business requirements,
identified business data elements, completed database design and assembled Excel source­to­target mappings for all
107 interfaces. Your project manager stops by to tell you that a new supply director wants to change five of your key
requirements, and that the database administrators redesigned one section of the model to meet new standards. She
needs an impact assessment before the steering committee meeting at 10:00 a.m. tomorrow morning.
Once development starts, data integration projects are served well by metadata tools. Major ETL vendors like
Informatica, IBM WebSphere DataStage and others include metadata solutions that document mappings and
transformations, enabling impact analysis in the event of interface changes, database design changes and data quality
problems. However, metadata associated with development tools only kicks in when development starts. A significant
part of data integration effort happens before the virtual pen meets paper to build ETL maps. Development can only
begin after the team defines requirements, designs the database and maps source data elements to their targets.
On a small integration project with just a few interfaces, this isn’t a problem. However, on a more complex project
featuring scores of interfaces, it can be a major manual task to ensure the design meets all requirements, that the
interfaces as designed will load the database properly, and to correctly identify the redesign needed in response to
changes in the interface, data model or requirements.
This article describes a working SQL Server prototype (found here) that shows how a data integration team can build
its own custom metadata database based on commonly used data integration design artifacts and enable the impact
analysis and change management benefits that metadata provides well before ETL coding starts. Furthermore,
metadata analysis can help improve the rigor and quality of requirements, mapping and database design artifacts,
preventing defects early and, therefore, helping to improve quality and reduce cost.
The Data Integration Project Pattern
Data integration projects typically implement systems that take in interface data, store it in a database and deliver that
data in the form of outbound interfaces (see figure). Interfaces may be in various formats but often can be thought of
as files containing fields. Likewise, the database consists of tables and columns. Requirements, in our example, are
affirmative statements of business objectives, which subdivide into “data requirements,” the logical data elements
needed to fulfill each requirement.
The industry standard interface definition artifact is the “source to target document,” typically a spreadsheet relating
(for incoming interfaces) each incoming field with its destination database column and describing any transformations
that must occur in the process of delivering the incoming field.
On the typical integration project, activities like making sure different interfaces are mapped consistently, or making
sure the design loads all required database columns, or determining the impact of a design change are a matter of
manually comparing source­to­target mapping spreadsheets – not a problem on a small project with only a few
interfaces.
However, Frederick Brooks’ Mythical Man­Month concept comes into play quickly as project size increases. Brooks’
concept is that project complexity increases exponentially as the size of the team increases. According to Brooks,
project complexity is related more to the number of communication paths than the number of people. There are n(n­
1)/2 communication paths among n people. So, a four­person project has six communication paths, an eight­person
project has 28, and a 25­person project has 300. By this logic, the 25­person project is more than 10 times more
complex than the eight­person project.
It is easy to see how this concept operates on a large integration project. Say there are 75 source­to­target mapping
documents, each mapping about 100 interface fields, halfway into the design phase. If a major requirements shift
results in a change to database design, then the team embarks on a major review and revision of all 75 mapping
spreadsheets, spending tense hours in review meetings to ensure the changes are applied consistently and the
project remains on schedule.
The Conceptual Integration Metadata Model
It is also easy to see how a database linking the field­to­column mappings, database tables and columns, and data
requirements could quickly identify the mappings impacted by the database change and reduce the level of effort in
identifying defects in the revised mappings. Fortunately, data integration teams tend to be rich in database skills, so
development of a metadata database should be within the practical scope of the team’s abilities.
Again, the goal is to integrate requirements, interfaces and database tables, along with their detailed elements: data
requirements, interface fields and database columns, as in the conceptual metadata model in the figure:
The model shows that each requirement implies many data requirements, each table consists of columns and each
interface consists of fields.
At a detailed level, the relationships show that:
• Each data requirement may be fulfilled by many database columns,
• Each column can potentially fulfill many data requirements,
• Each column may be the destination for load of many fields (for example, if many interface fields combine in an
equation to calculate a single column), and
• Each field may load many separate columns.
Data Integration Analysis and Design Artifacts
The core analysis documents of the data integration project are the source­to­target document, the data model and
the requirements document. These artifacts will provide the data to be loaded into the database based on our logical
metamodel.
Source to Target Document
Before ETL developers can build interface processes, analysts must document which interface field loads which
column or vice versa. Source­to­target maps are typically Excel spreadsheets. Each row on the sheet represents the
relationship between one data input element and one destination element, and typically includes the following
columns:
Interface Name: The name of the external source or target data structure;
Direction: Whether the interface is incoming or outgoing from the point of view of the central database;
Target Table or File Name;
Target Data Elements: descriptions of each target column or field, including Column or Field Name, Data
Type, Length, Description, Source Table or File Name;
Source Data Elements: descriptions of each target column or field, including Column or Field Name, Data
Type, Length, Description;
Transformations: the calculations, validations and/or conversions required to source data elements to
target data elements; and
Comments.
Data Model
For our purposes, the data model is the list of tables and columns in the database, and the data type and length of
each column. The data model must be complete before source­to­target mapping begins, optimally already deployed
to the database management system. If so, and if the DBMS is SQL Server, then the system views in
INFORMATION_SCHEMA provide complete application­level metadata, including table names, the columns in each
table, and data type and length for each column.
Requirements and Data Requirements
Requirements documentation tends to be less structured than source­to­target mappings and databases. But still,
data integration teams commonly maintain requirements in spreadsheets, and on most projects it is practical for data
requirements to be reasonably rigorous and therefore comparable to database columns and interface fields. The next
figure lists examples of requirements and associated data requirements from our fictional example project.
Bringing it All Together: the Data Integration Metadata Database
[Click on PDF link at the end of the article to view the Data Integration Meta­Database Model.]
This figure is a SQL Server model showing a database designed to house the contents of the project’s source­to­
target maps, the tables and columns in the database and data requirements. The database consists of the following
tables:
Requirement: The requirement table holds one row for each data­related business requirement, and
includes the text description of the requirement along with reference and priority attributes.
DataReqmt: A Data requirement is a logical data element needed to fulfill a requirement. There might be
many data requirements associated with a requirement.
DBTable: Each DBTable row represents a table in the integrated database.
DBColumn: Each row of this table describes a column in the integrated database, with descriptive
attributes drawn from the database catalog (in our case SQL Server’s
INFORMATION_SCHEMA.COLUMNS view).
Interface: This table contains rows describing external interfaces with descriptive attributes, including an
indicator of whether the interface is incoming or outgoing.
Field: Each row of this table describes a data element delivered by the interface (or to the interface if it is
outbound). Descriptive attributes include data type, length, whether the field is required, text describing
the field’s allowable values, and a FieldGroupNameText identifying the subgroup that the field occupies in
the Interface’s format definition. The metamodel could be extended if it needs to account for more
complex interface data structures.
ReqmtMapping: This is the associative table representing the many­to­many relationship between data
requirements and integration database columns. The table includes columns identifying the matched
DataReqmt and DBColumn, as well as a text “Condition” column detailing the nature of the relationship if
the column partially or conditionally fulfills the data requirement.
InterfaceMapping: Each row of this associative table corresponds with a row on the source­to­target
spreadsheet, showing the connection between a DBColumn and an incoming or outgoing interface field.
In addition to attributes identifying each related column and field, text attributes house required
transformations and additional comments.
Each table in the metamodel includes typical audit columns enabling administrators to track inserts and updates, with
the associated user ID.
Loading and Using the Integration Meta­Database
Data integration teams are rich in the skills required to deploy, load and use the integration metadata database. The
contents of DBTable and DBColumn can be inserted via simple INSERT queries from
INTEGRATION_SCHEMA.COLUMNS, as shown (again, assuming SQL Server – all leading DBMS offerings have
similar catalog tables). It is also a simple matter to load the source­to­target documents and requirements
spreadsheets to their respective tables using SQL Server’s Import and Export Data Wizard, which steps the user
though development of a repeatable SQL Server Integration Services job for loading the requirements and source to
target mapping spreadsheets.
Once the integration metadatabase is populated the team can begin to use basic SQL queries to answer questions
that formerly would have required tedious, error­prone spreadsheet­to­spreadsheet comparisons:
Will the defined interfaces load all expected database tables?
Does the database as designed meet all data requirements?
Which interfaces will I affect if I redesign a table?
If the quality of a given source data element is found to be poor, what requirements are at risk?
Using Integration Design Metadata on Your Projects
Bringing data integration requirements, data model and interfaces together into a metadata database can substantially
benefit your integration project, but as always there are a few caveats and potential surprises to watch out for.
Integration Project Benefits
We’ve already described the core benefits to quality control and change management. The team can run SQL queries
to quickly ensure coverage of requirements, ensure full population of the core database and outbound interfaces, and
assess the impact of changes to requirements, database design and interfaces. In addition, side effects help improve
the project’s likelihood of on­time delivery of a quality result:
Improved process consistency and rigor: The requirement that analysis and design artifacts be loaded
to a database means they must be consistent and correct, at least in terms of format. On large projects,
different subteams develop different requirements definitions and ETL mapping conventions. Loading
these artifacts to a database imposes a rigor on the process that otherwise requires much manual review
to enforce.
Improved quality of analysis and design deliverables: Using the metadatabase to enforce foreign key
relationships, making it impossible to add, for example, a data requirement for which there is no
requirement, or a field that has no associated interface, ensures that the working spreadsheets have no
loose ends caused by unconnected columns, fields and data requirements.
Improved communications: Often, business and technical team members divide into separate camps
that don’t communicate as well as they should. The relationships needed in the metadata database
require business and technical participants to communicate, even if only to correctly map requirements to
columns. That communication alone substantially improves specificity of requirements and business
alignment of the database.
Applying the Metadata Prototype
As always, these benefits don’t come for free. Teams implementing a metadata solution like this one should be aware
of these considerations:
Metadata management requires resources: While metadata development, maintenance and administration are
well within the data integration team’s skill set, they are nontrivial tasks. Level of effort depends on project size and
specifics, but estimate at least the same level of effort allocated to maintenance of tracking documents on a typical
agile project.
The prototype is functional but not production­ready: Generally, teams should revise the prototype code and
spreadsheet formats to conform to site conventions and standards. Teams should especially:
Apply appropriate metadata change management: The prototype includes staging targets for the SQL
Server Integration Services jobs that load the requirements and source­to­target mapping spreadsheets.
Insert statements migrate data from staging to the metadata database assuming they are populating the
tables from scratch. Your project may need to delete then insert relevant rows, or update or insert
depending on whether or not the given spreadsheet row already exists.
Apply effective/end dating if needed: The prototype metadata tables include typical audit columns, but are
not effective dated. If your team requires, for example, the ability to reverse a project change and go back
to an earlier state, or to track the lineage of a requirement and its associated columns and fields, you
might want to add effective and end dates to each table.
Additional rigor requires additional management: Any project methodology change requires planning and
management on the part of team leadership, and requires them to make a choice between fighting off alligators or
draining the swamp. Leaders of projects using a metadata solution must demand the additional rigor required to
support the needed processes and standards, and reject work that doesn’t meet the requirements and ETL mapping
standards that the metadata solution requires. The reward is the ability to run a few queries and get the answers to
the PM in time for tomorrow’s steering committee meeting.
© 2016 SourceMedia. All rights reserved.
Mobile Version