Download Finding and Fixing Data Quality Problems

Document related concepts

Database wikipedia , lookup

Big data wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Finding and Fixing
Data Quality Problems
NEARC Fall, 2010
Newport, RI
Brian Hebert, Solutions Architect
www.scribekey.com
Goal: Help You Improve Your Data
• Provide definition for Data Quality
• Consider Data Quality within context of several
data integration scenarios
• Suggest a framework and workflow for improving
Data Quality
• Review tools and techniques, independent of
specific products and platforms
• Help you plan and execute a Data Quality
improvement project or program
• Review take-aways and Q&A
2
www.scribekey.com
Essential Data Quality Components
Meaning
Structure
Contents
Data is well understood, well structured, and fully populated
with the right values FOR END USE.
Note: These fundamental elements of data quality overlap.
3
www.scribekey.com
Data Quality (DQ) Defined
Meaning: Names and definitions of all layers and attributes
are fully understood and clear for end users community
(a.k.a. semantics).
Structure: The appropriate database design is used
including attribute data types, lengths, formats, domains
(lookup tables), and relationships.
Contents: The actual data contents are fully populated with
valid values and match meaning and structure.
Metadata: Meaning, Structure, Contents described in Data
Dictionary or a similar metadata artifact.
4
www.scribekey.com
Scenarios: DQ Improvement as Data Integration
1) You want to improve the
data quality in a stand alone
independent dataset. Some
aspect of meaning,
structure, contents can be
improved.
2) You want to combine
multiple disparate datasets
into a single representation.
Departments,
organizations, systems,
functions, are merging or
need to share info
Source
Target
Source1
Target
Source2
For both cases, many of the same tools and techniques can be
used. In fact, it’s often beneficial, in divide and conquer approach,
to always start with 1
5
www.scribekey.com
Typical Data Quality/Integration Situations
Data is in different formats, schemas, versions, but
provides some of the same information, examples:
• You need to clean up a single existing dataset
• 2 departments in utility company: Customer billing and
outage management, central db and field operations
• Merging 2 separate databases/systems: getting town
CAMA data into GIS
• Consolidating N datasets: MassGIS Parcel Database, CDC
Disease Records from individual states
• 2 city/state/federal organizations: Transportation and
Emergency Management need common view
• Preparing for Enterprise Application Integration:
wrapping legacy systems in XML web services
6
www.scribekey.com
Scenario 1 Case Study:
Cleaning Up Facility Data
• Organization maintains information on facility
assets.
• The information includes data describing basic
location, facility type, size, function, and contact
information.
• Organization needs decision support database.
• Data has some quality issues.
• Case is somewhat generic, could apply to
buildings, complexes, sub-stations, exchange
centers, industrial plants, etc.
• Idea: Some identification with your data.
7
www.scribekey.com
Solution: Workflow Framework and Foundation Data Integration Support Database and Ordered Tasks
INVENTORY
A
B
COLLECTION
C
A
B
C
DATA
PROFILING
APPLICATIONS
A
B
C
INTEGRATION
SUPPORT
DB
A
B
C
CENTRAL RDB
VALIDATION
A
B
C
STANDARDIZE
& MAP/GAP
Iterative Operations
SCHEMA
GENERATION
ETL
8
www.scribekey.com
Solution Support: Ordered Workflow Steps
• Inventory: What do we have, where, who, etc.?
• Collection: Get some samples.
• Data Profiling and Assessment: Capture and analyze the
meaning, structure, and content of source(s)
• Schema Generation: One or several steps to determine what
target data should look like.
• Schema Gap and Map: What are differences, description of
how we get from A to B
• ETL: Physical implementation of getting from A to B, code, SQL,
script, etc.
• Validation: Do results match goals?
• Applications: Test data through applications, aggregations.
• Repeat Processing for Updates: Swap in a newer version of
data source A.
9
www.scribekey.com
Inventory: The Dublin Core (+)
NUM
ELEMENT
DEFINITION
1 Contributor
4 Date
5 Description
An entity responsible for making contributions to the resource.
The spatial or temporal topic of the resource, the spatial
applicability of the resource, or the jurisdiction under which the
resource is relevant.
An entity primarily responsible for making the resource.
A point or period of time associated with an event in the lifecycle
of the resource.
An account of the resource.
6 Format
The file format, physical medium, or dimensions of the resource.
2 Coverage
3 Creator
7
8
9
10
11
Identifier
Language
Publisher
Relation
Rights
An unambiguous reference to the resource within a given context.
A language of the resource.
An entity responsible for making the resource available.
A related resource.
Information about rights held in and over the resource.
12
13
14
15
Source
Subject
Title
Type
The resource from which the described resource is derived.
The topic of the resource.
A name given to the resource.
The nature or genre of the resource.
http://dublincore.org/documents/dces
Question: How do you capture information on existing data?
www.scribekey.com
10
Multiple Data Description Sources for Inventory
Website
Documentation
Metadata
INVENTORY
Email
People/SME’s
Data Itself
Gather info about data from a variety of sources
www.scribekey.com
11
The Data Profile: Meaning, Structure, Contents
NUM
1
2
3
4
5
6
ELEMENT
DatasetId
DatabaseName
TableName
RecordCount
ColumnCount
NumberOfNulls
DEFINITION
A unique identifier for the dataset
The name of the source database
The name of the source database table
The number of records in the table
The number of columns in the table
The number of null values in the table
NUM ELEMENT
The Column Profile
is helpful for getting
a detailed
understanding of
database structure
and contents
The Table Profile is
helpful for getting a
good overall idea of
what’s in a database
DEFINITION
1
DatasetId
A unique identifier for the dataset
2
DatabaseName The name of the database
3
TableName
The name of the database table
4
ColumnName
The name of the data column
5
DataType
The data type of the column
6
MaxLength
The max length of the column
7
DistinctValues
The number of distinct values used in the column
8
PercentDistinct The percentage of distinct values used in the column
9
SampleValues
10
MinLengthValue The minimum length data value
11
MaxLengthValue The maximum length data value
12
MinValue
The minimum value
13
MaxValue
The maxim value
A sampling of data values used in the column
www.scribekey.com
12
How Data Profiling Works
Data Profiling (and Metadata Import)
Roads
Data Profiler
FGDC XML Metadata
Parcels
Integration
Support
DB
Data Dictionary
Buildings
XML Metadata
Import
No Metadata, End User
The profiler is an application that reads through data and gets
names, structure, contents, patterns, summary statistics. You
can also learn about data through documentation and end users
13
www.scribekey.com
Data Profiling: Table Information
Database
Acme
Techno
Name
FACILITY
OFFICE
Records Columns Values
53
14
12
12
636
168
Nulls
Complete
19
0
Columns
ID,TYPE,FIRST_NAME,LAST_NAME,
POS,ADDRESS,TOWN_STATE,ZIP,Sta
97.01 te,Tel,SECTION,AREA,INSPECT
ID, FacilId,
Num,Name,Address,Town,Area
100 Code,Tel,Job,Dept,Value
• The Table Profile gives a good overview of record counts,
number of columns, nulls, general completeness, and a list of
column names.
• Very helpful for quickly getting an idea of what’s in a database
and comparing sets of data which need to be integrated.
14
www.scribekey.com
Data Profiling: Column Information
Database
Table
Acme
Facility
Acme
Facility
Acme
Facility
Attrib Name
Type
ADDRESS
String
TYPE
String
FIRST_NAME String
Sample Values
10 STUART PL.,100 MAIN STREET,100
Thompson Pl,125 Stratford Pl,128
NULL
Corporate, Finance, Maintenance,
HR,
Length Distinct
50
50
50
1
50
51
MaxLength
Min
%
Distinct Nulls Complete
94.3
3
94.34
1.89 53
0
96.2
0
100
Max
Count
10 STUART REGINA ROAD
22 PL.
WEST
0
53
53
11Corporate
53
Maintenance
• The Columns Profile provides detailed information on the
structure and contents of the fields in the database.
• It provides the foundation for much of the integration work
that follows.
15
www.scribekey.com
Data Profiling: Domain Information
• Domains provide one of the most essential
data quality control elements.
• List Domains are lists of valid values for
given database columns, e.g., standard
state abbreviations MA, CT, RI, etc. for State
• Range domains provide minimum and
maximum values, primarily used for
numeric data types.
• Many domains can be discovered and/or
generated from the profile.
POS
Accountant
admin
Administration
ARCH
Architect
CEO
CFO
Consultant
CTO
Data Collection
Data Entry
Database
DBA
Developer
Dismissed
Lead Developer
Lead programmer
Database Admin
Question: Do you use profiling to get a concise
summary of data details?
www.scribekey.com
16
Workshop Exercise: Example Profile
17
www.scribekey.com
Workshop Exercise: Column Profile Analysis
1)
2)
3)
4)
5)
6)
7)
The profile itself is generated automatically. The real value is in
the results of the analysis: what needs to be done to data?
Are the column name and definition clear?
Are there other columns in other tables with the same kind of
data and intent with a better name?
Is the data type appropriate, e.g., many times numeric data and
dates can be found in text fields. Is the length appropriate?
Is this a unique primary key? Does the number of values equal
the number of records? Is it a foreign key?
Is this column being used? Is it empty? How many null values
are there? What percent complete is the value set?
Is there a rule which can be used to validate data in this column
as:
• List Domain or Range Domain
• Specific Format (regular expression)
• Other rules, possibly involving other columns
Does the data indicate that the column should be split into
another table and use an 1->N parent child relationship?
18
www.scribekey.com
Data Profilers
•
•
•
•
•
http://www.talend.com/products-data-quality/talendopen-profiler.php
http://en.wikipedia.org/wiki/DataCleaner
http://weblogs.sqlteam.com/derekc/archive/2008/05/20
/60603.aspx
http://www.dbaoracle.com/oracle_news/2005_12_29_Profiling_and_Cle
ansing_Data_using_OWB_Part1.htm
Request form at www.scribekey.com for profiler
(shareware)
Learn about and test profilers with your own data.
19
www.scribekey.com
Schema Generation Options
• Use a data driven approach. Define the new schema as
more formally defined meaning, structure, and contents
of source data.
• Use an external independent target schema. Sometimes
this is a requirement.
• In divide and conquer approach, use data-driven first as
staging schema. Improve data in and of itself. Then
consider ETL to more formal, possibly standard, external
target schema.
• Use a combination hybrid, using elements of both datadriven and external target schemas.
20
www.scribekey.com
Data Model Differences: Production vs. Decision Support
Normalized for referential integrity,
complex and slower performing
queries, data is edited
De-normalized for easily formed
and faster performing queries,
data is read-only
The data models and supporting tools used in data warehousing are
significantly different from those found across the geospatial community.
Geospatial data modelers tend to incorrectly use production models for
decision support databases.
21
www.scribekey.com
Normalization
• Normalization can get complicated, 1st, 2nd, 3rd
Forms, Boyce-Codd, etc.
• Some important basics:
– Don’t put multiple values in a single field
– Don’t grow a table column wise by adding values
over time
– Have a primary key
• However, you should de-normalize when designing
read-only decision support databases to facilitate
easy query formation and better performance
22
www.scribekey.com
De-Normalization and Heavy Indexing
Makes Queries Easier and Faster
• 1 De-Normalized Table: SELECT TYPE,
LOCATION FROM FACILITIES
• 3 Normalized Tables: SELECT
FACILITY_TYPES.TYPE,
LOCATIONS.LOCATION FROM
(FACILITIES INNER JOIN
FACILITY_TYPES ON FACILITIES.TYPE =
FACILITY_TYPES.ID) INNER JOIN
LOCATIONS ON FACILITIES.LOCATIONID
= LOCATIONS.ID;
• NAVTEQ SDC data is a good example.
De-normalized, e.g., County Name and
FIPS, highly indexed, very fast and easy
to use.
www.scribekey.com
FACILITIES
FACILITY_TYPES
LOCATIONS
FACILITIES
23
Distinguish between Decision Support and Production
Database Models !!! Use Both When Necessary
Presentation Layer
Presentation Layer
Business Logic Middle Tier
Layer – UML – OO Language
No Middle Tier
Data Access Layer
Data Access Layer
OLTP Database
OLAP Database
Production OLTP database solutions typically
use a middle tier for representing higher
level business objects and rules. This middle
tier is often designed using UML and
implemented with an Object Oriented
programming language.
Decision Support OLAP database
solutions typically have no Middle tier.
They present and access data directly
through query language behind pivot
tables and report generators.
www.scribekey.com
24
Standardization, Modeling, and Mapping
Close the gap between the source data and the target schema
REAL
RAW DATA
Inconsistent Types
Not Normalized
No Domains
Imperfect
Data
Schema
Gap
Solution
ABSTRACT
SCHEMA
Strong Types
Highly Normalized
Lots of Domains
Perfect
Use real data to inform and keep your modeling efforts focused
www.scribekey.com
25
Database Refactoring Approach
Patterns approach, Gang of
Four Book, great for new
systems.
Innovation: Martin Fowler,
Code Refactoring, fix what you
have
Agile Database, Scott Ambler,
Refactoring Databases
You are not starting from
scratch; need to make
modifications to something
which is being used.
List of Refactorings
www.scribekey.com
26
Sidebar: Relationship Discovery and Schema Matching:
Entities, Attributes, and Domain Values
Standardized Schema Matching Relationships
No. Match Attribute
1 Source Category
2
Source Entity
3 Target Category
4
Target Entity
5
Match Score
6
Match Type
7
SQL Where
8
Notes
• Equal
• Basic set theory
operations
• Can be used in end
user query tools
• Overlap
• Superset
• Can be used by other
schema matching
efforts and combined
for thesaurus
compilation.
• Subset
No. Match Attribute
1 Source Category
2
Source Entity
3 Target Category
4
Target Entity
5
Match Score
6
Match Type
7
SQL Where
8
Notes
• Null
Schema matching is necessary to discover and specify how categories,
entities, attributes, and domains from one system map into another.
Matching discovers relationships, Mapping specifies transforms
These maps are stored in XML documents, FME, Pervasive, etc. As with
metadata, it can be useful to store these in an RDB as well.
www.scribekey.com
27
Schema Matching: Map & Gap
ELEMENT
DEFINITION
EXAMPLE
LINGUISTICS
Equal
A is equal to B
First Name
Synonym
Subtype
A is a more specific subtype of B
Supervisor - Employee
Hyponym
Supertype
A is a more general supertype of B Employee - Supervisor
Part
A is a part of B
Employee - Department Meronym
Container
A is a container of B
Department - Employee Holonym
Related
A is related to B
Department - Operation
Hypernym
• A Gap Analysis exercise can precede a full mapping to get a
general idea on how two datasets relate.
• Gap Analysis is always in a direction from one set of data
elements to another.
• Simple scores can be added to give an overall metric of how
things relate.
28
www.scribekey.com
Sidebar: Schema Matching Entities Is Also Important
Multiple Hierarchical
Feature Sets
Node and Edge
Networks
=
Multiple Geometric
Representations
Multiple Locations
Multiple Occurrences
Polygon Centroid
Relationships
Well documented schema matching information, as metadata, helps reduce and/or
eliminate any confusion for integration developers and end users
29
www.scribekey.com
Mechanics of ETL: The Heart of the Matter
•
•
•
•
•
•
•
•
•
•
•
•
•
Change Name
Change Definition
Add Definition
Change Type
Change Length
Trim
Use List Domain
Use Range Domain
Split
Merge
Reformat
Use a Default
Create View
•
•
•
•
•
•
•
•
•
•
•
•
•
Change Case
Add Primary Key
Add Foreign Key
Add Constraints, Use RDB
Split Table to N->1
Pivot
Merge Tables
Remove Duplicates
Remove 1 to 1 Tables
Fill in Missing Values
Remove Empty Fields
Remove Redundant Fields
Verify with 2rd source
30
www.scribekey.com
Use a Staging Data Store, Separate MR
Source
MR
Keep a definitive snapshot copy of source, don’t
change it.
Staging
Execute ETL in a
staging data
store. Expect
multiple
iterations and
temporary
relaxed data
types will be
necessary
Don’t mix actual data with metadata
repository information, keep separate
databases
Target
Build final target dataset from staging data store.
31
www.scribekey.com
Choosing the Right Tool(s)
•
•
•
•
•
•
•
•
•
•
•
SQL
FME
ESRI Model Builder
Pervasive (formerly Data Junction)
Microsoft SQL Server Integration Services
Oracle Warehouse Builder
Talend
C#/VB.NET/OLE-DB
Java/JDBC
Scripts: VB, JS, Python, Perl
Business Objects, Informatica
Make the best use of the skills you have on your team. DB vs.
code situations and teams. Use a combination of methods.
32
www.scribekey.com
Sidebar: Use SQL Views to Simplify Read Only Data
• SQL Views provide a flexible and easy
mechanism for de-normalizing data
coming from production databases.
• Views are typically what the end user
in a database application sees. They
hide the multi-table complexity lying
underneath in the physical model.
• Think of the analysis database as a user
that doesn’t need or want to see this
underlying complexity.
• Good approach for generating GIS
parcel datasets from CAMA property
records.
• Can instantiate Views as Hard Tables,
update on some regular basis bath SQL
VIEW
SQL
MULTIPLE LINKED
TABLES
33
www.scribekey.com
ETL: SQL Data Manipulation
•
•
•
•
•
•
•
•
•
•
left(Address, instr(Address, ' ')) as AddressNum
left(POCName, instr(POCName, ' ')) as FirstName
right(POCName, len(POCName)-instr(POCName, ' ')) as LastName
int(numEmployees) as NumPersonnel
right(Address, len(Address)-instr(Address, ' ')) as StreetName
'(' & [Area Code] & ')' & '-' & Tel as Telephone
iif(instr(Tel, 'x')>0, right(Tel, len(Tel)-instr(Tel, 'x')), null) as TelExtension
ucase(Town) as City
iif(len(Zip)=5, null, right(Zip,4)) AS Zip4
iif(len(Zip)=5, Zip, left(Zip,5)) AS Zip5
34
www.scribekey.com
ETL: Look Up tables
• Clean and consistent domains
are one of the most important
things you can use to help
improve data quality.
• As example, consider use of
single central master street
list(s) for state, town, utility,
etc.
• One approach is to collect all
of the variations in a single
look up table and match them
with the appropriate value
from the master table.
Original
Main St.
Elm St.
ELM STREET
Main Street
NORTH STREET
North
Master
MAIN ST
ELM ST
ELM ST
MAIN ST
NORTH ST
NORTH ST
35
www.scribekey.com
ETL: Domain and List Cleaning Tools
• There are powerful tools available to help match
variation and master values.
• Example: SQL Server Integration Services Fuzzy Look Up
and Fuzzy Grouping Tools:
http://msdn.microsoft.com/en-us/library/ms137786.aspx
• These can be used to create easily reusable batch
processing tools.
• These list cleansing tools are often unfamiliar to
geospatial data teams.
• The saved match lists are also valuable for processing
new data sets.
36
www.scribekey.com
ETL: Regular Expressions
• Regular Expressions provide a powerful means for
validation and ETL, through sophisticated matching
and replacement routines.
• Example: Original Notes field has line feed control
characters. We want to replace them with spaces.
• Solution: Match “[\x00-\x1f]“ Replace With: “ “
• Need code as C#, VB, Python, etc.
newVal = Regex.Replace(origVal, regExpMatch, regExpReplace);
37
www.scribekey.com
ETL: Regular Expressions (cont.)
RegExp
Name
^[1-9]{3}-\d{4}$
ShortPhoneNumber
^[0-9]+$
PosInteger
^[1-9]{3}-[1-9]{3}-\d{4}$
LongPhoneNumber
^[-+]?([0-9]+(\.[0-9]+)?|\.[0-9]+)$
(^[0-9]+$)|(^[a-z]+$)
Double
AllLettersOrAllNumbers
Start and grow a list of regular expression match
and replace values.
Keep these in the Metadata Repository
Need code as C#, VB, Python, etc.
38
www.scribekey.com
ETL: Custom Code is Sometimes Required
There is no simple
SQL solution
The problem is more
complicated than
simple name, type,
or conditional value
change
You use preferred
programming
language and write
custom ETL 39
www.scribekey.com
ETL: Extending SQL
• Most dialects of SQL; Oracle, Sql Server, Access, etc.,
allow you to develop custom functions with language
like C#, Java, etc.
• For example, you can build and use
RegExpMatchAndReplace then use it in SQL
• You can also add table look up, scripting, etc.
• Very powerful and flexible approach
• Example:
UPDATE FACILITIES SET
NOTES = REGEXP(“[\x00-\x1f]“ , “ “),
FACILITY_TYPE = LOOKUP(THISVAL)
40
www.scribekey.com
ETL: Use Geo Processing to Fill In Attributes
Zip5
01234
01234
01234
01235
01235
01235
• Example: We want to
fill in missing Zip4
values to Facility
points from polygon
set.
• This is particularly
valuable for creating
hierarchical rollup/drill-down
aggregation datasets
• Use Arc Tool Box
Spatial Join
Zip4
1234
5678
6789
3456
3456
0123
41
www.scribekey.com
ETL: Breaking Out N to 1
• This problem occurs
very frequently when
cleaning up datasets.
• We have repeating
columns to capture
annual facility
inspections.
• This data should be
pivoted and moved to
another child table
• We can use SQL and
UNION capability to get
what we want.
• Can also reformat for
consistency at the
same time.
42
www.scribekey.com
ETL: Use of Union To Pivot and Break Out
SELECT Id as FacilityId, 2000 as Year,
iif(ucase(Insp2000) = 'Y' or
ucase(Insp2000) = 'T', 'Yes',
Insp2000) as Inspected from
AcmeFac
UNION
SELECT Id as FacilityId, 2005 as Year,
iif(ucase(Inspect05) = 'Y' or
ucase(Inspect05) = 'T', 'Yes',
Inspect05) as Inspected from
AcmeFac
UNION
SELECT Id as FacilityId, 2010 as Year, iif
(ucase(Insp_2010) = 'Y' or
ucase(Insp_2010) = 'Y', 'Yes',
Insp_2010) as Inspected from
AcmeFac
43
www.scribekey.com
ETL: One-Off vs. Repeatable Transforms
• Relying on manual tweaks and adjustments to completing
and filling in correct data values is problematic if you need
to set up repeatable ETL processes.
• It’s much harder and more complicated to set up
repeatable, ordered ETL routines, but well worth it if the
data is going to be updated on an on-going basis.
• Example: a dataset is cleaned with lots of SQL, scripts,
manual tweaks, etc. When a newer dataset is made
available, the same tasks need to be repeated, but the
details and the order in which they were performed were
not saved.
• Suggestion: Be very aware of whether you are doing ETL
as a one-off vs. something that will have to be repeated,
and plan accordingly, save your work.
44
www.scribekey.com
Bite the Bullet: Manual ETL
• Sometimes the data problems
are so tricky that you decide to
do a manual clean up.
• You could probably come up
with code containing very large
number of conditional tests and
solutions (brain-teaser), but it
would take longer than just
cleaning the data by hand.
• Depends on whether you are
doing a one-off or need to build
something for repeatable
import.
• This also applies to positioning
geospatial features against a
base map or ortho-image, e.g.,
after geocoding, etc. for
populating x,y attributes.
Site Plan Location
rm 203 drawer A
112 B
Room 100 dr. 1
Bld A Rm 209 Drawer 11
200 20
rm 500 d
59 d 33
Jones
Rm G12 Bld 9
Room 100 Drawer 10
8 30 6
45
www.scribekey.com
Checking Against External Sources
• One of the only ways to
actually ensure that a list of
feature values is complete is
by checking against an
external source.
• In this case, the data in and
of itself, does not necessarily
provide a definitive version
of the truth.
• You can not tell what may
be missing or what may be
incorrectly included in the
list.
• Get redundant external
information whenever it’s
available.
• In some cases the only way to
fill in a missing value is to
contact the original source of
the data.
• This can be highly granular
and time consuming.
• Need to make decision on
how important it is to have
100% complete data.
• This can be a case of
diminishing returns.
46
www.scribekey.com
Storing ETL Details in the Metadata Repository
TargetEntity
Facility
Facility
Facility
Facility
Facility
Facility
Facility
Facility
TargetAttribute
AreaCode
StreetName
Notes
SizeSqFeet
Zip9
FacilityType
FacilityName
SizeAcres
ETLType
SQLSelect
LookUp
RegExp
Script
SQLSelect
LookUp
RegExp
Script
Params
SourceAttribute
left(PhoneNumber,3)
StreetList
Address
RemoveControlChars
CalcFacilityArea
Zip5 & '-' & Zip4
FacilityType
FacilityCategory
RemoveMultiSpace
CalcFacilityArea
• The combination of the MR and code based tools provides a very
flexible and powerful environment for improving data quality.
• Many actions and parameters can be stored in the MR including
LookUp, RegExp, SQL Select Clauses, and even runtime evaluation
routines for code and scripts.
• Example: Translator makes use of configurable ETL Map and
Action table found in the Metadata Repository
47
www.scribekey.com
ETL: Staging Table Design and SQL Command Types
• Separate Source and Target
tables, requires joins.
• Can merge Source and
Target into Staging table.
• Decide what kind of
complexity is preferred.
• Can also split attributes and
geometry, size factor, and
use keys, initial insert, then
updates, then recombine.
• Build a single large SQL
statement for creating view
or hard table from results.
Source
Target
2 SEPARARATE TABLES,
REQUIRES JOINS
INSERT, UPDATE, SELECT
Source
Target
1 MERGED TABLE,
HARDER TO CREATE,
COLUMN NAMES UNIQUE
NO JOINS
UPDATE, THEN SELECT OUT
48
www.scribekey.com
ETL: Loop Design: Row, Column, Value Processing Order
• Row wise – SQL errors will fail
for entire row, no values
changed: UPDATE, INSERT,
SELECT INTO
1
2
3
• Column wise – Single UPDATE
statement against all columns
in the table, fast performance
making use of database engine
1
A
B
C
A
B
C
100
101
102
X
Y
Z
ROW WISE
2
100
101
102
3
X
Y
Z
COLUMN WISE
• Cell wise – UPDATE handles
each Row.Column value
individually, high granularity
and control, slower
performance
1
2
3
1
A
B
C
2
100
101
102
3
X
Y
Z
CELL WISE
49
www.scribekey.com
Finding Errors and Validation
• Run a post-ETL Profile to see before and after data
• Contents checker, which needs names, looks at actual values and
checks against List Domains, Range Domains, and Regular Expressions
• Output, describing data problems, is written to an audit table and is
used to help find and fix errors. Audit info has table, key to lookup,
column name, problem, and value.
Table
Facilities
Facilities
Facilities
Facilities
Id
101
25543
563
223
Column
Rule
Name
Not Null
SubType
Domain: MyList
NumPersonnel Range: 1 - 1000
Inspected
RegExp: Date
Value
Shipin
-500
0/930/1955
Question: How do you validate data after it has been
transformed?
www.scribekey.com
50
Overall Integration Operations Tracking
• Very helpful to record and document all of the different
operations being performed against the data.
• In general, record who did what, when, how, and why, plus
any other detailed information.
• Need to track both manual and automatic processing - tools
can automatically record their activities in a tracking table –
more difficult to track people’s activities
Action
Collection
Who
JS
Date
3/24/2008
Notes
Profiling
Profiling
Mapping
BH
RG
PB
2/30/2007
7/7/1008
6/30/2008
Did not complete, need new data
Completed
Dataset incomplete
Question: How do you handle tracking overall operations on
data?
www.scribekey.com
51
Repeat Data Updates and/or New Data
• A good test of how well the data integration operations have
succeeded is to process a new more up-to-date dataset
destined to replace existing data in the system.
• This is where keeping track of what was done to the data, and
in what order, is so important.
• Validate structure and contents. Changes in structure or
content require a re-tooling of the mapping information.
• Consider having to next include a new and different data
source, e.g., merging organizations combining facility data.
Question: How do you handle refreshing data from a new set of
update data?
52
www.scribekey.com
Agile vs. Waterfall Workflows
ITERATIVE/EVOLUTIONARY:
Perform each step with a subset of material and functionality. Loop back and continue
with lessons learned and more information. Divide and Conquer
Like steering a car: can’t fix position at onset, need to adjust as unforeseen conditions are
encountered. Analogy to sculpting. Do something useful in a reasonable amount of time
1
2
3
4
5
6
7
11
8
9
12
Faster
Benefits
for
End
Users
10
WATERFALL:
Perform each step completely before moving to the next.
www.scribekey.com
53
Use Clear Communication Artifacts
Layers
Attributes
Symbols
…
GIS Users
?
The Tower of Babel
UML, XSD
GML
ISO 19XXX,
…
Data Modelers
Standards Bodies
Use table centric documents and models, e.g., RDB, Excel, HTML to
communicate with end users and stakeholder in addition to UML, XSD, etc.
Not an either or – we need both.
54
www.scribekey.com
Data Dictionary and Metadata
• When you’re done cleaning up the data, make sure you fully
describe meaning, structure, and contents metadata in a data
dictionary.
• Must haves: Who, What, When, Where, Why, Accuracy, and
Attribute Definitions!
• Present the data dictionary in an easy to use tabular format,
e.g., Excel, HTML
• Ideally the metadata should live in the database along side
the data it is describing
• Separating metadata from data, and using fundamentally
different physical formats and structures leads to serious
synchronization problems.
• Data providers are encouraged to produce FGDC/ISO
geospatial metadata, as a report from repository
55
www.scribekey.com
Example Census Data Dictionary
HTML Browser
www.scribekey.com
www.scribekey.com
56
Sidebar: FGDC/ISO XML Metadata in an RDB
XML
Metadata
XML
Metadata
IMPORT
EXPORT
When this metadata is imported
into an RDB, the full flexibility of
SQL is available for very flexible
management and querying a large
collection of metadata as a set.
NUMELEMENT
1 Originator
2 Publication_Date
3 Title
4 Abstract
5 Purpose
6 Calendar_Date
7 Currentness_Reference
8 Progress
9 Maintenance_and_Update_Frequency
10 West_Bounding_Coordinate
11 East_Bounding_Coordinate
12 North_Bounding_Coordinate
13 South_Bounding_Coordinate
14 Theme_Keyword_Thesaurus
15 Theme_Keyword
16 Access_Constraints
17 Metadata_Date
18 Contact_Person
19 Address_Type
20 Address
21 City
22 State_or_Province
23 Postal_Code
24 Contact_Voice_Telephone
25 Metadata_Standard_Name
26 Metadata_Standard_Version
www.scribekey.com
57
Sidebar: XML Metadata After Import into RDB – Hierarchy Preserved
Name
origin
pubdate
NodeValue
NOAA
05/21/2004
title
ParentName
LineageId
5citeinfo
1.2.3.4.6
5citeinfo
1.2.3.4.7
LineageName
metadata.idinfo.citation.citeinfo.pubdateX
metadata.idinfo.citation.citeinfo.title
NauticalNAVAIDS
5citeinfo
1.2.8
metadata.idinfo.descript
geoform
vector digital data
5citeinfo
1.2.8.10
ftname
5citeinfo
purpose
langdata
caldate
time
current
progress
update
westbc
eastbc
northbc
NauticalNAVAIDS
Nautical
Navigational Aids:
US waters
Homeland
Security
en
20040528
unknown
publication date
In work
Unknown
-178.047715
174.060288
65.634111
southbc
17.650000
32bounding
metadata.idinfo.spdom
metadata.idinfo.spdom.bounding
metadata.idinfo.keywords
metadata.idinfo.keywords.theme
metadata.idinfo.keywords.theme.themekt
metadata.idinfo.accconst
metadata.idinfo.useconst
metadata.metainfo.metc
metadata.metainfo.metc.cntinfo.cntperp
metadata.metainfo.metc.cntinfo.cntaddr
metadata.metainfo.metc.cntinfo.cntaddr.addres
1.31.33.34.37.39 s
themekey
Nautical
Navigational Aids
51theme
1.3.50.51.52
metadata.idinfo.keywords.theme.themekey
www.scribekey.com
58
abstract
ParentId
metadata.idinfo.descript.purpose
metadata.idinfo.timeperd.timeinfo.sngdate.cald
1.2.11.12.13.14 ateX
17descript
1.2.16.18
17descript
17descript
24sngdate
24sngdate
22timeperd
28status
28status
32bounding
32bounding
32bounding
1.2.19
1.2.19.20
1.2.25
1.2.25.26
1.2.25.26.27
1.2.29
1.2.30
1.31.33
1.31.33.34.35
1.31.33.34.37
metadata.idinfo.status.update
Meta-Map for Data QA/QC
Map metadata to summarize and highlight datasets by validation
metadata.
www.scribekey.com
59
Applications: Business/Geo-Intelligence Pivot Tables/Maps
View and analyze both data
and metadata, quality,
completeness, etc.
STATE
A
B
C
A
B
COUNTY
C
A
B
C
TOWN
A
B
C
CENSUS TRACT
Business Intelligence data exploration/viewing solutions make heavy use of pivot
tables, drill-down, drill-through. With a data warehousing approach, geospatial
intelligence solutions can use a similar approach, with maps
www.scribekey.com
60
Use a Data Integration Repository Database
DATA
INTEGRATION
REPOSITORY
Data Layers
Metadata
A
B
C
A
B
C
Enhanced User
Views
A
B
C
Pivot Tables
Areas
Documents
A
Assessments
Data
Dictionary
B
ETL
C
A
B
Entities
C
A
B
C
Attributes Validation Domains
Derivative
Datasets
Meta-Maps
Schemas
The Data Integration Repository, implemented as an RDMBS, can be
populated by both manual and automated methods and then used
to generate metadata outputs, data dictionary content, schemas,
maps, etc.
www.scribekey.com
61
Data Quality Knowledge, Tools and Techniques
There is a wide variety of highly developed data quality, metadata
repository centric knowledge, refactoring, tools, and techniques
available in mainstream IT data warehousing to make use of in helping
to improve geospatial datasets.
www.scribekey.com
62
Sidebar: Physical Formats and Simplicity
• If data is in table format, CSV can be much
easier to work with than XML, can be 1/10th
size
• Look at newer smaller less-verbose data
exchange formats, e.g., JSON:
http://www.json.org/
• XML is best suited for variable length data
structures and nesting, e.g., object models
• RESTFULL Web Services vs. SOAP
• Keep it simple
63
www.scribekey.com
Recommendation: Use Broader Array of Mainstream IT
Tools and Techniques to Solve Data Quality Problems
(Look Outside of the Geospatial Word)
•
•
•
•
•
•
Decision Support, Data Warehousing
RDBMS Metadata Repositories
Data Profiling, Refactoring
Business Intelligence, ETL, OLAP Cubes
Structured vs. Unstructured Data Access, Semantic Web
Flexibility through standard RDBMS logical/physical
separation and the use of Views
• AGILE Solution Development
• Data Quality Paradigm
• Lean Manufacturing
64
www.scribekey.com
Recap: Take-Aways
•
•
•
•
•
•
•
•
•
•
•
•
Data quality is determined by end users understanding of meaning, structure, and
contents.
Look at data quality and data integration tools and techniques from mainstream IT, data
warehousing, business intelligence
CLEARLY DISTINGUISH BETWEEN DECISION-SUPPORT AND PRODUCTION DATABASE
MODELS - DON’T USE HIGHLY NORMALIZED SCHEMA FOR DECISION SUPPORT
DATABASES !!! MAKE DB EASY/FAST TO QUERY
Use database profiling and refactoring approaches
Use a relational database (metadata repository (MR)) to capture, centralize, and
manage data quality and integration information
Distinguish one-offs from on-going updates and build repeatable ETL processing when
necessary.
SQL, coding skills, regular expressions for data manipulation are all important.
Choose tools that leverage skills you have on hand, preferred language/scripts, etc.
Communicate clearly with stakeholders, end users, team, with table centric artifacts
Don’t ignore data meaning as important element in data quality, build data dictionaries
(metadata), use clear definitions, don’t make up words.
Save ETL mappings, work, notes, scripts, etc., to help grow and reuse skills (MR)
Use an iterative, Agile approach to help ensure you reach goals in timely manner
65
www.scribekey.com
Related Papers and Tools
Database Template Kit
Municipal Permit Tracking System
http://www.mass.gov (search for MPTS) - lots of SQL data cleansing info
NEARC Fall 09 Presentation:
How Data Profiling Can Help Clean and Standardize GIS Datasets
NEARC Spring 10 Presentation
Using Meta-Layers for GIS Dataset Tracking andManagement
http://gis.amherstma.gov/data/SpringNearc2010/C_Session3_01150245/GovPlanning/KeepingTrackOfAllThoseLayers.pdf
Thank You
Questions and Answers
www.scribekey.com
67