Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mehmet Orun, MBA, CDMP Reference Data Management is critical to success of Master Data Management, Data Warehousing, or SOA integration. Traditional RDM efforts by data architects are limited to Data Domain values in data models and mapping tables for integration (ETL, ESB…) developers This presentation provides an enterprise-wide view of what reference data is, why it matters, and how it can be managed, to increase the practical knowledge of the data professional. Mehmet Orun is the Sr. Manager for Data Architecture & Analysis as well as solution owner for Master Data Management at Genentech, where he is responsible for data services rollout and associated processes, technology roadmaps and methodologies. Mehmet’s interest and expertise includes data and meta data driven solutions to bridge the gap across applications, processes, and structured and unstructured data sources. He has master level CDMP certification in Data Management and Data & Information Quality and is a Senior Member of the ACM. Mehmet serves on the board of San Francisco DAMA Chapter Mehmet Orun is also the primary contributor to DAMA-DMBOK’s “Reference & Master Data Mngt” chapter. Some of the frameworks in this presentation are based on this content. What is Reference Data A Definition Some Examples The complexity Reference Data in the Enterprise RDM Function Conceptual Framework Reference Data Types Use/Design Patterns & Implications RDM Functions (from DMBOK) Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise. (Malcolm Chisholm, TDAN, 10/1/2001) Wikipedia suggests: Reference data are data describing a physical or virtual object and its properties, usually described with nouns. A special type of reference data is master reference data - these are reference data shared over a number of systems. The above mixes Reference Data Management, MDM and Data Standards (noted as many refer to Wikipedia as the authoritative source) State Codes [Postal] State Codes [-USA]: CA, MO, AL, … Hair Color: Brown, Blonde, … Diseases Cancer, Ulcer, … What does a code mean? CA: Canada or California? What is an authoritative source? USPS for State codes? ISO 3166 for country codes and if so ISO 3166-2 for state codes? What is an acceptable list of values? Hair color: Black, Brown, Blond, Red What about Chestnut? What about Pink, or Purple? What about Pink and Purple? What is the right level of detail “to categorize other data”? Cancer Breast Cancer 174.9, Female breast, unspecified (Ref: ICD9) What do we mean by a term anyway? Cancer? Cancer cell photo from alternative-cancer.net Crab family taxonomy from NCBI What are there reference data elements here? Trip Type Airport Code & Names Traveler Type Class of Service Fare Type Flags? Other? If you don’t know what something means, how do you get meaning out of it? If data is in different levels of detail from different sources, how do you maintain richness without sacrificing accuracy? How do you relate one set of categories to another, without standards? The rest of this presentation will describe different types of Reference Data and their use of Reference Data in the enterprise Enterprise Context and Reference Data Types Reference Data Management is control over defined domain values (also known as vocabularies), including control over standardized terms, code values, and other unique identifiers, business definitions for each value, business relationships within and across domain value lists, and the consistent, shared use of accurate, timely, and relevant reference data values to classify and categorize data. (DAMA DMBOK, Chapter 8) If you do not [somehow] manage your reference data, broader data integration and master data reconciliation will be challenging at best ANSI/NISO Z39.19-2005 is the authoritative standard and guideline for reference data/vocabulary management Information Integration • Code mapping rules • Single source for ESB, ETL, EII Data Governance & Quality • Manage terms, names, definitions • Code lists for quality rules Master Data Management • Use capability for Information Integration, Data Governance, and Quality Business Intelligence Knowledge Management • Hierarchies for drill down or data mining analytics Reference Data Management Solution (Vocabulary, Vocabulary View, Term-Relationship Management, Publication and Translation Services) • Import or link attributes and definitions as part of vocabulary meta data Data Models • Code lists/vocabularies and term relationships Code List and Vocabularies • Taxonomy or Ontology for Search and Navigation • Content Tagging What are there reference data elements here? San Francisco the city vs. a pointer to a metropolitan area SFO or San Francisco as airport indicators Hotel names with many synonyms … Simple to Complex Increasing complexity List Flat List Taxonomy Hierarchical •Parent / Child •Relationship between parent and a child can be relatively underspecified or ill defined. Thesaurus Ontology Equivalence - Synonym Homographic-Spelled the same Classes (general things) many different meaning – “Tank” Hierarchical-Broader than/Narrower Than Associated-Associated A thesaurus is typically used to associate the rough meaning of a term to the rough meaning of another term. domains of interest Instances (particular things) Relationships among those things Properties (values) of those things Functions of and processes involving those things Constraints on and rules involving those things. Simple Value Lists Defined Value Lists Taxonomic Reference Data Ontological Reference Data Relationship Types Computer Manufacturers H International Business Machines E ? Hardware E IBM H ? Software Group A Big Blue ? A Software E Equivalence H Hierarchical A Associative Equivalent Term Relationship: A relationship between or among terms in a controlled vocabulary that leads to one or more terms that are to be used instead of the term from which the cross-reference is made. (Z39.19) The relationship indicator for this type is ET. Hierarchical Relationship: A relationship between or among terms in a controlled vocabulary that depicts broader (generic) to narrower (specific) or whole-part relationships; begins with the words broader term (BT), or narrower term (NT). (Z39.19) The relationship indicator for this type is HT. Related Term Relationship: A term that is associatively but not hierarchically linked to another term in a controlled vocabulary. In thesauri, the relationship indicator for this type of term is RT. (Z39.19) Information Integration • Code mapping rules • Single source for ESB, ETL, EII Data Governance & Quality • Manage terms, names, definitions • Code lists for quality rules Master Data Management • Use capability for Information Integration, Data Governance, and Quality Business Intelligence Knowledge Management • Hierarchies for drill down or data mining analytics Reference Data Management Solution (Vocabulary, Vocabulary View, Term-Relationship Management, Publication and Translation Services) • Import or link attributes and definitions as part of vocabulary meta data Data Models • Code lists/vocabularies and term relationships Code List and Vocabularies • Taxonomy or Ontology for Search and Navigation • Content Tagging Implementation Patterns & Data Management Functions If selecting a single value stores it in a database row/cell, what happens when you select multiple values? How do you report on any one value? a. b. c. d. Select … where <field> = <value> Select … where <field> IN {value} Select … where <field> like ‘%value%’ Other? Who is an AV buf in the audience? Who knows their doctor well? Reference Data Management is control over defined domain values (also known as vocabularies) Vocabulary Management function of defining, sourcing, importing, and maintaining, … any given vocabulary Key questions: ▪ What information concepts will this vocabulary support? (which data attributes) ▪ Who is the audience? ▪ Why is the vocabulary needed? (app, BI , content management?) ▪ Who are the decision makers? ▪ What are the current vocabularies? ▪ Are there existing standards? Term: One or more words designating a concept Term Management Specifying how Terms are initially defined and classified and how this information is maintained once it starts Most significant effort is to identify terms that will constitute the standard Preferred Term: One of two or more synonyms selected as a term for inclusion in a controlled vocabulary Term Relationship Management How the terms relate Hierarchical Relationship depicts broader (generic) to narrower (specific) or whole-part relationships In data integration, if terms are inconsistent, map to broader relationship for accuracy (data detail loss will occur) In BI, consider allowing mixed granularity Related Term Relationship depitcs associatively without hierarchical relationships Term Change Due to deprecation or its metadata changing If definition is improved, no impact If value/data type… changes, assess impact Term Relationship Change If relationships are changing, integration or aggregation rules may be impacted Vocabulary Version Change Important especially when external standards are used Minor (addition/deprecation) vs. Major revisions Reference Data/Vocabularies/Business Semantics are pervasive in the enterprise. Broad implementations would likely take time, cost money, and not clearly understood Like many data management initiatives, pick an area with business value that can drive focus, develop a sustainable solution and perhaps obtain tools to help. Process and accountability is more important than technology Track value, share the success, and expand your scope