Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect www.scribekey.com Goal: Help You Improve Your Data • Provide definition for Data Quality • Consider Data Quality within context of several data integration scenarios • Suggest a framework and workflow for improving Data Quality • Review tools and techniques, independent of specific products and platforms • Help you plan and execute a Data Quality improvement project or program • Review take-aways and Q&A 2 www.scribekey.com Essential Data Quality Components Meaning Structure Contents Data is well understood, well structured, and fully populated with the right values FOR END USE. Note: These fundamental elements of data quality overlap. 3 www.scribekey.com Data Quality (DQ) Defined Meaning: Names and definitions of all layers and attributes are fully understood and clear for end users community (a.k.a. semantics). Structure: The appropriate database design is used including attribute data types, lengths, formats, domains (lookup tables), and relationships. Contents: The actual data contents are fully populated with valid values and match meaning and structure. Metadata: Meaning, Structure, Contents described in Data Dictionary or a similar metadata artifact. 4 www.scribekey.com Scenarios: DQ Improvement as Data Integration 1) You want to improve the data quality in a stand alone independent dataset. Some aspect of meaning, structure, contents can be improved. 2) You want to combine multiple disparate datasets into a single representation. Departments, organizations, systems, functions, are merging or need to share info Source Target Source1 Target Source2 For both cases, many of the same tools and techniques can be used. In fact, it’s often beneficial, in divide and conquer approach, to always start with 1 5 www.scribekey.com Typical Data Quality/Integration Situations Data is in different formats, schemas, versions, but provides some of the same information, examples: • You need to clean up a single existing dataset • 2 departments in utility company: Customer billing and outage management, central db and field operations • Merging 2 separate databases/systems: getting town CAMA data into GIS • Consolidating N datasets: MassGIS Parcel Database, CDC Disease Records from individual states • 2 city/state/federal organizations: Transportation and Emergency Management need common view • Preparing for Enterprise Application Integration: wrapping legacy systems in XML web services 6 www.scribekey.com Scenario 1 Case Study: Cleaning Up Facility Data • Organization maintains information on facility assets. • The information includes data describing basic location, facility type, size, function, and contact information. • Organization needs decision support database. • Data has some quality issues. • Case is somewhat generic, could apply to buildings, complexes, sub-stations, exchange centers, industrial plants, etc. • Idea: Some identification with your data. 7 www.scribekey.com Solution: Workflow Framework and Foundation Data Integration Support Database and Ordered Tasks INVENTORY A B COLLECTION C A B C DATA PROFILING APPLICATIONS A B C INTEGRATION SUPPORT DB A B C CENTRAL RDB VALIDATION A B C STANDARDIZE & MAP/GAP Iterative Operations SCHEMA GENERATION ETL 8 www.scribekey.com Solution Support: Ordered Workflow Steps • Inventory: What do we have, where, who, etc.? • Collection: Get some samples. • Data Profiling and Assessment: Capture and analyze the meaning, structure, and content of source(s) • Schema Generation: One or several steps to determine what target data should look like. • Schema Gap and Map: What are differences, description of how we get from A to B • ETL: Physical implementation of getting from A to B, code, SQL, script, etc. • Validation: Do results match goals? • Applications: Test data through applications, aggregations. • Repeat Processing for Updates: Swap in a newer version of data source A. 9 www.scribekey.com Inventory: The Dublin Core (+) NUM ELEMENT DEFINITION 1 Contributor 4 Date 5 Description An entity responsible for making contributions to the resource. The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant. An entity primarily responsible for making the resource. A point or period of time associated with an event in the lifecycle of the resource. An account of the resource. 6 Format The file format, physical medium, or dimensions of the resource. 2 Coverage 3 Creator 7 8 9 10 11 Identifier Language Publisher Relation Rights An unambiguous reference to the resource within a given context. A language of the resource. An entity responsible for making the resource available. A related resource. Information about rights held in and over the resource. 12 13 14 15 Source Subject Title Type The resource from which the described resource is derived. The topic of the resource. A name given to the resource. The nature or genre of the resource. http://dublincore.org/documents/dces Question: How do you capture information on existing data? www.scribekey.com 10 Multiple Data Description Sources for Inventory Website Documentation Metadata INVENTORY Email People/SME’s Data Itself Gather info about data from a variety of sources www.scribekey.com 11 The Data Profile: Meaning, Structure, Contents NUM 1 2 3 4 5 6 ELEMENT DatasetId DatabaseName TableName RecordCount ColumnCount NumberOfNulls DEFINITION A unique identifier for the dataset The name of the source database The name of the source database table The number of records in the table The number of columns in the table The number of null values in the table NUM ELEMENT The Column Profile is helpful for getting a detailed understanding of database structure and contents The Table Profile is helpful for getting a good overall idea of what’s in a database DEFINITION 1 DatasetId A unique identifier for the dataset 2 DatabaseName The name of the database 3 TableName The name of the database table 4 ColumnName The name of the data column 5 DataType The data type of the column 6 MaxLength The max length of the column 7 DistinctValues The number of distinct values used in the column 8 PercentDistinct The percentage of distinct values used in the column 9 SampleValues 10 MinLengthValue The minimum length data value 11 MaxLengthValue The maximum length data value 12 MinValue The minimum value 13 MaxValue The maxim value A sampling of data values used in the column www.scribekey.com 12 How Data Profiling Works Data Profiling (and Metadata Import) Roads Data Profiler FGDC XML Metadata Parcels Integration Support DB Data Dictionary Buildings XML Metadata Import No Metadata, End User The profiler is an application that reads through data and gets names, structure, contents, patterns, summary statistics. You can also learn about data through documentation and end users 13 www.scribekey.com Data Profiling: Table Information Database Acme Techno Name FACILITY OFFICE Records Columns Values 53 14 12 12 636 168 Nulls Complete 19 0 Columns ID,TYPE,FIRST_NAME,LAST_NAME, POS,ADDRESS,TOWN_STATE,ZIP,Sta 97.01 te,Tel,SECTION,AREA,INSPECT ID, FacilId, Num,Name,Address,Town,Area 100 Code,Tel,Job,Dept,Value • The Table Profile gives a good overview of record counts, number of columns, nulls, general completeness, and a list of column names. • Very helpful for quickly getting an idea of what’s in a database and comparing sets of data which need to be integrated. 14 www.scribekey.com Data Profiling: Column Information Database Table Acme Facility Acme Facility Acme Facility Attrib Name Type ADDRESS String TYPE String FIRST_NAME String Sample Values 10 STUART PL.,100 MAIN STREET,100 Thompson Pl,125 Stratford Pl,128 NULL Corporate, Finance, Maintenance, HR, Length Distinct 50 50 50 1 50 51 MaxLength Min % Distinct Nulls Complete 94.3 3 94.34 1.89 53 0 96.2 0 100 Max Count 10 STUART REGINA ROAD 22 PL. WEST 0 53 53 11Corporate 53 Maintenance • The Columns Profile provides detailed information on the structure and contents of the fields in the database. • It provides the foundation for much of the integration work that follows. 15 www.scribekey.com Data Profiling: Domain Information • Domains provide one of the most essential data quality control elements. • List Domains are lists of valid values for given database columns, e.g., standard state abbreviations MA, CT, RI, etc. for State • Range domains provide minimum and maximum values, primarily used for numeric data types. • Many domains can be discovered and/or generated from the profile. POS Accountant admin Administration ARCH Architect CEO CFO Consultant CTO Data Collection Data Entry Database DBA Developer Dismissed Lead Developer Lead programmer Database Admin Question: Do you use profiling to get a concise summary of data details? www.scribekey.com 16 Workshop Exercise: Example Profile 17 www.scribekey.com Workshop Exercise: Column Profile Analysis 1) 2) 3) 4) 5) 6) 7) The profile itself is generated automatically. The real value is in the results of the analysis: what needs to be done to data? Are the column name and definition clear? Are there other columns in other tables with the same kind of data and intent with a better name? Is the data type appropriate, e.g., many times numeric data and dates can be found in text fields. Is the length appropriate? Is this a unique primary key? Does the number of values equal the number of records? Is it a foreign key? Is this column being used? Is it empty? How many null values are there? What percent complete is the value set? Is there a rule which can be used to validate data in this column as: • List Domain or Range Domain • Specific Format (regular expression) • Other rules, possibly involving other columns Does the data indicate that the column should be split into another table and use an 1->N parent child relationship? 18 www.scribekey.com Data Profilers • • • • • http://www.talend.com/products-data-quality/talendopen-profiler.php http://en.wikipedia.org/wiki/DataCleaner http://weblogs.sqlteam.com/derekc/archive/2008/05/20 /60603.aspx http://www.dbaoracle.com/oracle_news/2005_12_29_Profiling_and_Cle ansing_Data_using_OWB_Part1.htm Request form at www.scribekey.com for profiler (shareware) Learn about and test profilers with your own data. 19 www.scribekey.com Schema Generation Options • Use a data driven approach. Define the new schema as more formally defined meaning, structure, and contents of source data. • Use an external independent target schema. Sometimes this is a requirement. • In divide and conquer approach, use data-driven first as staging schema. Improve data in and of itself. Then consider ETL to more formal, possibly standard, external target schema. • Use a combination hybrid, using elements of both datadriven and external target schemas. 20 www.scribekey.com Data Model Differences: Production vs. Decision Support Normalized for referential integrity, complex and slower performing queries, data is edited De-normalized for easily formed and faster performing queries, data is read-only The data models and supporting tools used in data warehousing are significantly different from those found across the geospatial community. Geospatial data modelers tend to incorrectly use production models for decision support databases. 21 www.scribekey.com Normalization • Normalization can get complicated, 1st, 2nd, 3rd Forms, Boyce-Codd, etc. • Some important basics: – Don’t put multiple values in a single field – Don’t grow a table column wise by adding values over time – Have a primary key • However, you should de-normalize when designing read-only decision support databases to facilitate easy query formation and better performance 22 www.scribekey.com De-Normalization and Heavy Indexing Makes Queries Easier and Faster • 1 De-Normalized Table: SELECT TYPE, LOCATION FROM FACILITIES • 3 Normalized Tables: SELECT FACILITY_TYPES.TYPE, LOCATIONS.LOCATION FROM (FACILITIES INNER JOIN FACILITY_TYPES ON FACILITIES.TYPE = FACILITY_TYPES.ID) INNER JOIN LOCATIONS ON FACILITIES.LOCATIONID = LOCATIONS.ID; • NAVTEQ SDC data is a good example. De-normalized, e.g., County Name and FIPS, highly indexed, very fast and easy to use. www.scribekey.com FACILITIES FACILITY_TYPES LOCATIONS FACILITIES 23 Distinguish between Decision Support and Production Database Models !!! Use Both When Necessary Presentation Layer Presentation Layer Business Logic Middle Tier Layer – UML – OO Language No Middle Tier Data Access Layer Data Access Layer OLTP Database OLAP Database Production OLTP database solutions typically use a middle tier for representing higher level business objects and rules. This middle tier is often designed using UML and implemented with an Object Oriented programming language. Decision Support OLAP database solutions typically have no Middle tier. They present and access data directly through query language behind pivot tables and report generators. www.scribekey.com 24 Standardization, Modeling, and Mapping Close the gap between the source data and the target schema REAL RAW DATA Inconsistent Types Not Normalized No Domains Imperfect Data Schema Gap Solution ABSTRACT SCHEMA Strong Types Highly Normalized Lots of Domains Perfect Use real data to inform and keep your modeling efforts focused www.scribekey.com 25 Database Refactoring Approach Patterns approach, Gang of Four Book, great for new systems. Innovation: Martin Fowler, Code Refactoring, fix what you have Agile Database, Scott Ambler, Refactoring Databases You are not starting from scratch; need to make modifications to something which is being used. List of Refactorings www.scribekey.com 26 Sidebar: Relationship Discovery and Schema Matching: Entities, Attributes, and Domain Values Standardized Schema Matching Relationships No. Match Attribute 1 Source Category 2 Source Entity 3 Target Category 4 Target Entity 5 Match Score 6 Match Type 7 SQL Where 8 Notes • Equal • Basic set theory operations • Can be used in end user query tools • Overlap • Superset • Can be used by other schema matching efforts and combined for thesaurus compilation. • Subset No. Match Attribute 1 Source Category 2 Source Entity 3 Target Category 4 Target Entity 5 Match Score 6 Match Type 7 SQL Where 8 Notes • Null Schema matching is necessary to discover and specify how categories, entities, attributes, and domains from one system map into another. Matching discovers relationships, Mapping specifies transforms These maps are stored in XML documents, FME, Pervasive, etc. As with metadata, it can be useful to store these in an RDB as well. www.scribekey.com 27 Schema Matching: Map & Gap ELEMENT DEFINITION EXAMPLE LINGUISTICS Equal A is equal to B First Name Synonym Subtype A is a more specific subtype of B Supervisor - Employee Hyponym Supertype A is a more general supertype of B Employee - Supervisor Part A is a part of B Employee - Department Meronym Container A is a container of B Department - Employee Holonym Related A is related to B Department - Operation Hypernym • A Gap Analysis exercise can precede a full mapping to get a general idea on how two datasets relate. • Gap Analysis is always in a direction from one set of data elements to another. • Simple scores can be added to give an overall metric of how things relate. 28 www.scribekey.com Sidebar: Schema Matching Entities Is Also Important Multiple Hierarchical Feature Sets Node and Edge Networks = Multiple Geometric Representations Multiple Locations Multiple Occurrences Polygon Centroid Relationships Well documented schema matching information, as metadata, helps reduce and/or eliminate any confusion for integration developers and end users 29 www.scribekey.com Mechanics of ETL: The Heart of the Matter • • • • • • • • • • • • • Change Name Change Definition Add Definition Change Type Change Length Trim Use List Domain Use Range Domain Split Merge Reformat Use a Default Create View • • • • • • • • • • • • • Change Case Add Primary Key Add Foreign Key Add Constraints, Use RDB Split Table to N->1 Pivot Merge Tables Remove Duplicates Remove 1 to 1 Tables Fill in Missing Values Remove Empty Fields Remove Redundant Fields Verify with 2rd source 30 www.scribekey.com Use a Staging Data Store, Separate MR Source MR Keep a definitive snapshot copy of source, don’t change it. Staging Execute ETL in a staging data store. Expect multiple iterations and temporary relaxed data types will be necessary Don’t mix actual data with metadata repository information, keep separate databases Target Build final target dataset from staging data store. 31 www.scribekey.com Choosing the Right Tool(s) • • • • • • • • • • • SQL FME ESRI Model Builder Pervasive (formerly Data Junction) Microsoft SQL Server Integration Services Oracle Warehouse Builder Talend C#/VB.NET/OLE-DB Java/JDBC Scripts: VB, JS, Python, Perl Business Objects, Informatica Make the best use of the skills you have on your team. DB vs. code situations and teams. Use a combination of methods. 32 www.scribekey.com Sidebar: Use SQL Views to Simplify Read Only Data • SQL Views provide a flexible and easy mechanism for de-normalizing data coming from production databases. • Views are typically what the end user in a database application sees. They hide the multi-table complexity lying underneath in the physical model. • Think of the analysis database as a user that doesn’t need or want to see this underlying complexity. • Good approach for generating GIS parcel datasets from CAMA property records. • Can instantiate Views as Hard Tables, update on some regular basis bath SQL VIEW SQL MULTIPLE LINKED TABLES 33 www.scribekey.com ETL: SQL Data Manipulation • • • • • • • • • • left(Address, instr(Address, ' ')) as AddressNum left(POCName, instr(POCName, ' ')) as FirstName right(POCName, len(POCName)-instr(POCName, ' ')) as LastName int(numEmployees) as NumPersonnel right(Address, len(Address)-instr(Address, ' ')) as StreetName '(' & [Area Code] & ')' & '-' & Tel as Telephone iif(instr(Tel, 'x')>0, right(Tel, len(Tel)-instr(Tel, 'x')), null) as TelExtension ucase(Town) as City iif(len(Zip)=5, null, right(Zip,4)) AS Zip4 iif(len(Zip)=5, Zip, left(Zip,5)) AS Zip5 34 www.scribekey.com ETL: Look Up tables • Clean and consistent domains are one of the most important things you can use to help improve data quality. • As example, consider use of single central master street list(s) for state, town, utility, etc. • One approach is to collect all of the variations in a single look up table and match them with the appropriate value from the master table. Original Main St. Elm St. ELM STREET Main Street NORTH STREET North Master MAIN ST ELM ST ELM ST MAIN ST NORTH ST NORTH ST 35 www.scribekey.com ETL: Domain and List Cleaning Tools • There are powerful tools available to help match variation and master values. • Example: SQL Server Integration Services Fuzzy Look Up and Fuzzy Grouping Tools: http://msdn.microsoft.com/en-us/library/ms137786.aspx • These can be used to create easily reusable batch processing tools. • These list cleansing tools are often unfamiliar to geospatial data teams. • The saved match lists are also valuable for processing new data sets. 36 www.scribekey.com ETL: Regular Expressions • Regular Expressions provide a powerful means for validation and ETL, through sophisticated matching and replacement routines. • Example: Original Notes field has line feed control characters. We want to replace them with spaces. • Solution: Match “[\x00-\x1f]“ Replace With: “ “ • Need code as C#, VB, Python, etc. newVal = Regex.Replace(origVal, regExpMatch, regExpReplace); 37 www.scribekey.com ETL: Regular Expressions (cont.) RegExp Name ^[1-9]{3}-\d{4}$ ShortPhoneNumber ^[0-9]+$ PosInteger ^[1-9]{3}-[1-9]{3}-\d{4}$ LongPhoneNumber ^[-+]?([0-9]+(\.[0-9]+)?|\.[0-9]+)$ (^[0-9]+$)|(^[a-z]+$) Double AllLettersOrAllNumbers Start and grow a list of regular expression match and replace values. Keep these in the Metadata Repository Need code as C#, VB, Python, etc. 38 www.scribekey.com ETL: Custom Code is Sometimes Required There is no simple SQL solution The problem is more complicated than simple name, type, or conditional value change You use preferred programming language and write custom ETL 39 www.scribekey.com ETL: Extending SQL • Most dialects of SQL; Oracle, Sql Server, Access, etc., allow you to develop custom functions with language like C#, Java, etc. • For example, you can build and use RegExpMatchAndReplace then use it in SQL • You can also add table look up, scripting, etc. • Very powerful and flexible approach • Example: UPDATE FACILITIES SET NOTES = REGEXP(“[\x00-\x1f]“ , “ “), FACILITY_TYPE = LOOKUP(THISVAL) 40 www.scribekey.com ETL: Use Geo Processing to Fill In Attributes Zip5 01234 01234 01234 01235 01235 01235 • Example: We want to fill in missing Zip4 values to Facility points from polygon set. • This is particularly valuable for creating hierarchical rollup/drill-down aggregation datasets • Use Arc Tool Box Spatial Join Zip4 1234 5678 6789 3456 3456 0123 41 www.scribekey.com ETL: Breaking Out N to 1 • This problem occurs very frequently when cleaning up datasets. • We have repeating columns to capture annual facility inspections. • This data should be pivoted and moved to another child table • We can use SQL and UNION capability to get what we want. • Can also reformat for consistency at the same time. 42 www.scribekey.com ETL: Use of Union To Pivot and Break Out SELECT Id as FacilityId, 2000 as Year, iif(ucase(Insp2000) = 'Y' or ucase(Insp2000) = 'T', 'Yes', Insp2000) as Inspected from AcmeFac UNION SELECT Id as FacilityId, 2005 as Year, iif(ucase(Inspect05) = 'Y' or ucase(Inspect05) = 'T', 'Yes', Inspect05) as Inspected from AcmeFac UNION SELECT Id as FacilityId, 2010 as Year, iif (ucase(Insp_2010) = 'Y' or ucase(Insp_2010) = 'Y', 'Yes', Insp_2010) as Inspected from AcmeFac 43 www.scribekey.com ETL: One-Off vs. Repeatable Transforms • Relying on manual tweaks and adjustments to completing and filling in correct data values is problematic if you need to set up repeatable ETL processes. • It’s much harder and more complicated to set up repeatable, ordered ETL routines, but well worth it if the data is going to be updated on an on-going basis. • Example: a dataset is cleaned with lots of SQL, scripts, manual tweaks, etc. When a newer dataset is made available, the same tasks need to be repeated, but the details and the order in which they were performed were not saved. • Suggestion: Be very aware of whether you are doing ETL as a one-off vs. something that will have to be repeated, and plan accordingly, save your work. 44 www.scribekey.com Bite the Bullet: Manual ETL • Sometimes the data problems are so tricky that you decide to do a manual clean up. • You could probably come up with code containing very large number of conditional tests and solutions (brain-teaser), but it would take longer than just cleaning the data by hand. • Depends on whether you are doing a one-off or need to build something for repeatable import. • This also applies to positioning geospatial features against a base map or ortho-image, e.g., after geocoding, etc. for populating x,y attributes. Site Plan Location rm 203 drawer A 112 B Room 100 dr. 1 Bld A Rm 209 Drawer 11 200 20 rm 500 d 59 d 33 Jones Rm G12 Bld 9 Room 100 Drawer 10 8 30 6 45 www.scribekey.com Checking Against External Sources • One of the only ways to actually ensure that a list of feature values is complete is by checking against an external source. • In this case, the data in and of itself, does not necessarily provide a definitive version of the truth. • You can not tell what may be missing or what may be incorrectly included in the list. • Get redundant external information whenever it’s available. • In some cases the only way to fill in a missing value is to contact the original source of the data. • This can be highly granular and time consuming. • Need to make decision on how important it is to have 100% complete data. • This can be a case of diminishing returns. 46 www.scribekey.com Storing ETL Details in the Metadata Repository TargetEntity Facility Facility Facility Facility Facility Facility Facility Facility TargetAttribute AreaCode StreetName Notes SizeSqFeet Zip9 FacilityType FacilityName SizeAcres ETLType SQLSelect LookUp RegExp Script SQLSelect LookUp RegExp Script Params SourceAttribute left(PhoneNumber,3) StreetList Address RemoveControlChars CalcFacilityArea Zip5 & '-' & Zip4 FacilityType FacilityCategory RemoveMultiSpace CalcFacilityArea • The combination of the MR and code based tools provides a very flexible and powerful environment for improving data quality. • Many actions and parameters can be stored in the MR including LookUp, RegExp, SQL Select Clauses, and even runtime evaluation routines for code and scripts. • Example: Translator makes use of configurable ETL Map and Action table found in the Metadata Repository 47 www.scribekey.com ETL: Staging Table Design and SQL Command Types • Separate Source and Target tables, requires joins. • Can merge Source and Target into Staging table. • Decide what kind of complexity is preferred. • Can also split attributes and geometry, size factor, and use keys, initial insert, then updates, then recombine. • Build a single large SQL statement for creating view or hard table from results. Source Target 2 SEPARARATE TABLES, REQUIRES JOINS INSERT, UPDATE, SELECT Source Target 1 MERGED TABLE, HARDER TO CREATE, COLUMN NAMES UNIQUE NO JOINS UPDATE, THEN SELECT OUT 48 www.scribekey.com ETL: Loop Design: Row, Column, Value Processing Order • Row wise – SQL errors will fail for entire row, no values changed: UPDATE, INSERT, SELECT INTO 1 2 3 • Column wise – Single UPDATE statement against all columns in the table, fast performance making use of database engine 1 A B C A B C 100 101 102 X Y Z ROW WISE 2 100 101 102 3 X Y Z COLUMN WISE • Cell wise – UPDATE handles each Row.Column value individually, high granularity and control, slower performance 1 2 3 1 A B C 2 100 101 102 3 X Y Z CELL WISE 49 www.scribekey.com Finding Errors and Validation • Run a post-ETL Profile to see before and after data • Contents checker, which needs names, looks at actual values and checks against List Domains, Range Domains, and Regular Expressions • Output, describing data problems, is written to an audit table and is used to help find and fix errors. Audit info has table, key to lookup, column name, problem, and value. Table Facilities Facilities Facilities Facilities Id 101 25543 563 223 Column Rule Name Not Null SubType Domain: MyList NumPersonnel Range: 1 - 1000 Inspected RegExp: Date Value Shipin -500 0/930/1955 Question: How do you validate data after it has been transformed? www.scribekey.com 50 Overall Integration Operations Tracking • Very helpful to record and document all of the different operations being performed against the data. • In general, record who did what, when, how, and why, plus any other detailed information. • Need to track both manual and automatic processing - tools can automatically record their activities in a tracking table – more difficult to track people’s activities Action Collection Who JS Date 3/24/2008 Notes Profiling Profiling Mapping BH RG PB 2/30/2007 7/7/1008 6/30/2008 Did not complete, need new data Completed Dataset incomplete Question: How do you handle tracking overall operations on data? www.scribekey.com 51 Repeat Data Updates and/or New Data • A good test of how well the data integration operations have succeeded is to process a new more up-to-date dataset destined to replace existing data in the system. • This is where keeping track of what was done to the data, and in what order, is so important. • Validate structure and contents. Changes in structure or content require a re-tooling of the mapping information. • Consider having to next include a new and different data source, e.g., merging organizations combining facility data. Question: How do you handle refreshing data from a new set of update data? 52 www.scribekey.com Agile vs. Waterfall Workflows ITERATIVE/EVOLUTIONARY: Perform each step with a subset of material and functionality. Loop back and continue with lessons learned and more information. Divide and Conquer Like steering a car: can’t fix position at onset, need to adjust as unforeseen conditions are encountered. Analogy to sculpting. Do something useful in a reasonable amount of time 1 2 3 4 5 6 7 11 8 9 12 Faster Benefits for End Users 10 WATERFALL: Perform each step completely before moving to the next. www.scribekey.com 53 Use Clear Communication Artifacts Layers Attributes Symbols … GIS Users ? The Tower of Babel UML, XSD GML ISO 19XXX, … Data Modelers Standards Bodies Use table centric documents and models, e.g., RDB, Excel, HTML to communicate with end users and stakeholder in addition to UML, XSD, etc. Not an either or – we need both. 54 www.scribekey.com Data Dictionary and Metadata • When you’re done cleaning up the data, make sure you fully describe meaning, structure, and contents metadata in a data dictionary. • Must haves: Who, What, When, Where, Why, Accuracy, and Attribute Definitions! • Present the data dictionary in an easy to use tabular format, e.g., Excel, HTML • Ideally the metadata should live in the database along side the data it is describing • Separating metadata from data, and using fundamentally different physical formats and structures leads to serious synchronization problems. • Data providers are encouraged to produce FGDC/ISO geospatial metadata, as a report from repository 55 www.scribekey.com Example Census Data Dictionary HTML Browser www.scribekey.com www.scribekey.com 56 Sidebar: FGDC/ISO XML Metadata in an RDB XML Metadata XML Metadata IMPORT EXPORT When this metadata is imported into an RDB, the full flexibility of SQL is available for very flexible management and querying a large collection of metadata as a set. NUMELEMENT 1 Originator 2 Publication_Date 3 Title 4 Abstract 5 Purpose 6 Calendar_Date 7 Currentness_Reference 8 Progress 9 Maintenance_and_Update_Frequency 10 West_Bounding_Coordinate 11 East_Bounding_Coordinate 12 North_Bounding_Coordinate 13 South_Bounding_Coordinate 14 Theme_Keyword_Thesaurus 15 Theme_Keyword 16 Access_Constraints 17 Metadata_Date 18 Contact_Person 19 Address_Type 20 Address 21 City 22 State_or_Province 23 Postal_Code 24 Contact_Voice_Telephone 25 Metadata_Standard_Name 26 Metadata_Standard_Version www.scribekey.com 57 Sidebar: XML Metadata After Import into RDB – Hierarchy Preserved Name origin pubdate NodeValue NOAA 05/21/2004 title ParentName LineageId 5citeinfo 1.2.3.4.6 5citeinfo 1.2.3.4.7 LineageName metadata.idinfo.citation.citeinfo.pubdateX metadata.idinfo.citation.citeinfo.title NauticalNAVAIDS 5citeinfo 1.2.8 metadata.idinfo.descript geoform vector digital data 5citeinfo 1.2.8.10 ftname 5citeinfo purpose langdata caldate time current progress update westbc eastbc northbc NauticalNAVAIDS Nautical Navigational Aids: US waters Homeland Security en 20040528 unknown publication date In work Unknown -178.047715 174.060288 65.634111 southbc 17.650000 32bounding metadata.idinfo.spdom metadata.idinfo.spdom.bounding metadata.idinfo.keywords metadata.idinfo.keywords.theme metadata.idinfo.keywords.theme.themekt metadata.idinfo.accconst metadata.idinfo.useconst metadata.metainfo.metc metadata.metainfo.metc.cntinfo.cntperp metadata.metainfo.metc.cntinfo.cntaddr metadata.metainfo.metc.cntinfo.cntaddr.addres 1.31.33.34.37.39 s themekey Nautical Navigational Aids 51theme 1.3.50.51.52 metadata.idinfo.keywords.theme.themekey www.scribekey.com 58 abstract ParentId metadata.idinfo.descript.purpose metadata.idinfo.timeperd.timeinfo.sngdate.cald 1.2.11.12.13.14 ateX 17descript 1.2.16.18 17descript 17descript 24sngdate 24sngdate 22timeperd 28status 28status 32bounding 32bounding 32bounding 1.2.19 1.2.19.20 1.2.25 1.2.25.26 1.2.25.26.27 1.2.29 1.2.30 1.31.33 1.31.33.34.35 1.31.33.34.37 metadata.idinfo.status.update Meta-Map for Data QA/QC Map metadata to summarize and highlight datasets by validation metadata. www.scribekey.com 59 Applications: Business/Geo-Intelligence Pivot Tables/Maps View and analyze both data and metadata, quality, completeness, etc. STATE A B C A B COUNTY C A B C TOWN A B C CENSUS TRACT Business Intelligence data exploration/viewing solutions make heavy use of pivot tables, drill-down, drill-through. With a data warehousing approach, geospatial intelligence solutions can use a similar approach, with maps www.scribekey.com 60 Use a Data Integration Repository Database DATA INTEGRATION REPOSITORY Data Layers Metadata A B C A B C Enhanced User Views A B C Pivot Tables Areas Documents A Assessments Data Dictionary B ETL C A B Entities C A B C Attributes Validation Domains Derivative Datasets Meta-Maps Schemas The Data Integration Repository, implemented as an RDMBS, can be populated by both manual and automated methods and then used to generate metadata outputs, data dictionary content, schemas, maps, etc. www.scribekey.com 61 Data Quality Knowledge, Tools and Techniques There is a wide variety of highly developed data quality, metadata repository centric knowledge, refactoring, tools, and techniques available in mainstream IT data warehousing to make use of in helping to improve geospatial datasets. www.scribekey.com 62 Sidebar: Physical Formats and Simplicity • If data is in table format, CSV can be much easier to work with than XML, can be 1/10th size • Look at newer smaller less-verbose data exchange formats, e.g., JSON: http://www.json.org/ • XML is best suited for variable length data structures and nesting, e.g., object models • RESTFULL Web Services vs. SOAP • Keep it simple 63 www.scribekey.com Recommendation: Use Broader Array of Mainstream IT Tools and Techniques to Solve Data Quality Problems (Look Outside of the Geospatial Word) • • • • • • Decision Support, Data Warehousing RDBMS Metadata Repositories Data Profiling, Refactoring Business Intelligence, ETL, OLAP Cubes Structured vs. Unstructured Data Access, Semantic Web Flexibility through standard RDBMS logical/physical separation and the use of Views • AGILE Solution Development • Data Quality Paradigm • Lean Manufacturing 64 www.scribekey.com Recap: Take-Aways • • • • • • • • • • • • Data quality is determined by end users understanding of meaning, structure, and contents. Look at data quality and data integration tools and techniques from mainstream IT, data warehousing, business intelligence CLEARLY DISTINGUISH BETWEEN DECISION-SUPPORT AND PRODUCTION DATABASE MODELS - DON’T USE HIGHLY NORMALIZED SCHEMA FOR DECISION SUPPORT DATABASES !!! MAKE DB EASY/FAST TO QUERY Use database profiling and refactoring approaches Use a relational database (metadata repository (MR)) to capture, centralize, and manage data quality and integration information Distinguish one-offs from on-going updates and build repeatable ETL processing when necessary. SQL, coding skills, regular expressions for data manipulation are all important. Choose tools that leverage skills you have on hand, preferred language/scripts, etc. Communicate clearly with stakeholders, end users, team, with table centric artifacts Don’t ignore data meaning as important element in data quality, build data dictionaries (metadata), use clear definitions, don’t make up words. Save ETL mappings, work, notes, scripts, etc., to help grow and reuse skills (MR) Use an iterative, Agile approach to help ensure you reach goals in timely manner 65 www.scribekey.com Related Papers and Tools Database Template Kit Municipal Permit Tracking System http://www.mass.gov (search for MPTS) - lots of SQL data cleansing info NEARC Fall 09 Presentation: How Data Profiling Can Help Clean and Standardize GIS Datasets NEARC Spring 10 Presentation Using Meta-Layers for GIS Dataset Tracking andManagement http://gis.amherstma.gov/data/SpringNearc2010/C_Session3_01150245/GovPlanning/KeepingTrackOfAllThoseLayers.pdf Thank You Questions and Answers www.scribekey.com 67