Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
February 5, 2014 Stevens Award lecture WCRE-CSMR 2014 PReCISE Data matters most but where has all the semantics gone? - A (sort of) spatio-temporal view of DB reverse engineering - Jean-Luc Hainaut University of Namur Faculté d'informatique PReCISE Research Center - Database Engineering Group www.info.fundp.ac.be/libd 2 • Introduction • Understanding data semantics • Data models • Tracing data semantics • Recovering hidden data semantic • Is data semantics recovery that important, actually? • Summary and conclusions 3 Introduction Objectives of the lecture 4 1. To study the concept of data semantics in business applications 2. To identify and evaluate the techniques used to represent data semantics 3. To observe how these techniques have evolved in time and in different cultures. 4. To discuss the methods used to recover the semantics lost when poor representation techniques have been used. The role of data in business applications 5 Axioms on databases 1. The database is a picture of the application domain • Its schema is a model of the static structures of the domain • Its data describe the current state (or suite thereof) of the domain 2. The database is designed independently of the application programs The database is designed before the application programs 3. The database schema evolution translates the evolution of the functional requirements 4. The database is described by (at least) two schemas: • the conceptual schema: abstract, platform-independent formalism: ER model, conceptual UML class diagrams • the logical schema: concrete, platform-dependent formalism: SQL2, Java classes There exists a bidirectional mapping between both. The role of data in business applications Meta-axioms on axioms on databases 1. The axioms often are ignored by developers - ignore = how interesting! I didn't know them - ignore = I know them but they do not suit my way of working 3. The biggest violation of the axioms concern the existence and role of the conceptual schema 6 7 Understanding data semantics Experimental approach and first conclusions 8 Preliminary question Same data, different structures CUSTOMER CustID C400 B512 S144 Name Darwen Owens Garcia City London NY Madrid To what extent does each of these data sets expresses the semantics of data? T C1 C400 B512 S144 C2 C3 Darwen Owens Garcia London NY Madrid T C C400 B512 S144 Darwen Owens Garcia London NY Madrid Motivating example. 1. Reading data from a COBOL file (1970) application code (COBOL) CUSTOMER CustID B512 Name Owens City NY WORKING-STORAGE SECTION. 01 CUSTOMER. 02 CustID PIC X(12). 02 Name PIC X(60). 02 City PIC X(40). external file READ FILE1 INTO CUSTOMER. REC RKEY C400 B512 S144 RINFO Darwen London Owens NY Garcia Madrid SELECT FILE1 ASSIGN TO "FILE1.DAT" ORGANIZATION IS INDEXED ACCESS MODE IS DYNAMIC RECORD KEY IS RKEY. FD FILE1. 01 REC. 02 RKEY PIC X(12). 02 RINFO PIC X(100). CUSTOMER CustID Name City REC RKEY RINFO 9 Motivating example: 1. Reading data from a COBOL file (1970) 93% 10% CUSTOMER REC CustID Name City RKEY RINFO Where has data semantics been defined? • In file description (10%) - [unique key, key data type] • In application code (93%). 10 11 Motivating example. 2. Reading data from an RDB (1980+) application code (C) v1 B512 v2 Owens v3 NY string v1; string v2; string v3; select * v1,v2,v3 CUSTOMER CustID = into from where 'B512' Relational DB CUSTOMER CustID C400 B512 S144 Name Darwen Owens Garcia City London NY Madrid create table CUSTOMER( CustID char(12) not null, Name char(60) not null, City char(40) not null, primary key (CustID)). v1 CUSTOMER v2 CustID Name City v3 Motivating example: 2. Reading data from an RDB (1980+) 3% 100% v1 CUSTOMER v2 CustID Name City v3 Where has data semantics been defined? • In DB schema (100%) • In application code (3%) - [data type]. 12 What does data semantics mean? A tentative practical definition Data semantics is the knowledge defined by all the non technical, domain-dependent, information that allows us to understand, to use and to manage the data. 13 Where can we find traces of data semantics? Application program DB schema data in the application code (reading from file) in the DB schema (reading from DB) 14 A first (trivial) observation It is best to express data semantics in the database schema 1. Expressiveness: DDL is the most appropriate language to declare data structures and constraints 2. Language independence: DDL is independent of application programming languages 3. Uniqueness: the schema is unique and centralized 4. Integration with data: the schema is a part of the database (no risk to loose it!)) 5. Program independence: the schema is independent of application programs 6. Stability. The schema must be changed only when the application domain evolve. 15 However, things are not always that simple (e.g.,COBOL files) Only data structures are explicit in application programs: • record name • field name • field data type Additional constraints generally are controlled by the application code: • where? • in which way? • in all the modules processing the data? Understanding data semantics by analyzing the program code can be much complex than expected. 16 However, things are not always that simple (e.g., RDB) Only standard integrity constraints can be coded through the DDL (SQL2): • not null • uniqueness • referential integrity Additional constraints must be coded through generic means: • check predicates • triggers • store procedures Understanding data semantics by reading the database schema can be less easy than expected. 17 18 Data models 19 Data models: abstraction hierarchy Reminder on the database design process - The standard view User requirements Information analysis Conceptual schema Logical design Logical (RDB) schema Physical design Physical (DB2) schema Coding SQL-DDL code 999. Data semantics and data models The way data semantics is expressed in a database depends on its data model Conceptual models • ER (*) • UML class diagrams Logical models • Record oriented models: • files • legacy DBMS (IMS, CODASYL) • RDB (*) • Key-Value models: • NoSQL (*) • CSV • Structured object models: • OO • NoSQL • Json (*) • XML 20 21 ER conceptual model Abstract, platform-independent information description The world is perceived as: - sets of entities, - properties that characterize entities - relationships holding between entities CUSTOMER Cus tID Nam e City id: Cus tID 0-N place 1-1 A conceptual schema can be translated into several logical, DBMS-dependent, schemas ORDER OrdID DateOrd Account id: OrdID 22 Relational data model (schema-based, 1NF) CustID C400 B512 S144 • • • • • Name Darwen Owens Garcia City London NY Madrid Account metadata -124 5509 0 data Domain-dependent schema Schema and data are hierarchically distinct Values are aggregated into rows The semantics is explicit in the schema (part of!) The semantic is managed/controlled by the DBMS Examples: Oracle, DB2, SQL Server, MySQL, PostgreSQL, etc. Key-Value data model (schema-less, triples, 1NF) ENTITY 90317 90317 90317 90317 59731 59731 59731 59731 66830 66830 66830 66830 • • • • • ATTRIBUTE CustID Name City Account CustID Name City Account CustID Name City Account VALUE C400 Darwen London -124 B512 Owens NY 5509 S144 Garcia Madrid 0 meta-metadata metadata data Domain-independent schema Metadata mixed with data Elementary Key-Value The semantics is explicit in the data The semantics is managed/controlled by application programs or middleware Examples: Oracle NoSQL, BerkeleyDB, Voldemort, Riak, Redis 23 24 Structured object data models (schema-less, NF2) ENTITY 90317 59731 66830 ATTRIBUTES {"CustID": "C400", "Name": "Darwen","City": "London", "Account": 124} {"CustID": "B512", "Name": "Owens", "City": "NY", "Account": 5509} {"CustID": "S144", "Name": "Garcia", "City": "Madrid", "Account": 0} meta-metadata metadata data • • • • • Domain-independent schema Metadata mixed with data Aggregated Key-Value into objects (here in Json) The semantics is explicit in the data The semantic is managed/controlled by application programs or middleware Examples: CouchDB, MongoDB (BSON), SimpleDB 25 Tracing data semantics In the real world, where is semantics expressed? We have identified two places: DB schema and application code. Are there other places? 26 27 Architectural framework Documentation (text, structured, ontology) Doc User interface - data structure - labels - help, error messages) Application program Application code - data structures - procedural code) Class schema class schema Object/Relational mapping O/R Mapping DB schema data DB logical schema - global schema - views Data Semantics in the documentation Documentation (text, structured, ontology) Doc Functional documentation (should include the conceptual schema) Application program Technical documentation (should include the logical schema) Drawback the documentation often is • obsolete, • incomplete, • inconsistent • missing class schema O/R Mapping DB schema data 28 8. Semantics in the DB schema DB logical schema - global logical schema - views Doc The logical schema is DBMS-dependent. Application program class schema It is a more or less faithful implementation of the conceptual schema. Some views can be more detailed than the logical schema. Drawbacks • not a conceptual schema • additional constraints not always trivial to identify and to understand O/R Mapping DB schema data 29 10. Semantics in the class schema Class schema T Doc DB logical schema Bidirectional relation/object transformation. Application program Solving the impedance mismatch problem The class schema seen as the domain model. It is implemented into a relational database, which ensures object persistence. class schema The DB schema itself is hidden and may bear little semantics. O/R Mapping DB schema data Drawbacks • inappropriate formalism • poor change propagation mechanism (if any) • semantics in the application and not in the DB • data model not easily shared by several applications 30 11. Semantics in the application code Application code - data structures - procedural code Doc Internal data structures may be more explicit that the DB schema. Application program Data integrity constraints checked by the application code. Understanding data semantics from the way programs process the data. class schema However, program analysis is far from trivial: • size (millions of LOC) • architectural complexity • algorithmic complexity • data flow complexity • creative data processing O/R Mapping DB schema data Drawbacks • redundancies (a constraint may be checked in many places) • distributed traces (potential inconsistencies) 31 32 12. Semantics in the GUI User interface - data structure - labels - help, error messages) Doc The UI often is a view on a part of the database. This view is intended for users user friendly. Application program Provides useful hints about the constraints and meaning of data: • • • • class schema O/R Mapping data structure (data types, aggregates) explicit labels sample data informative help and error messages Drawbacks • distributed control (potential inconsistencies) • does not cover all the database objects DB schema data 13. Semantics in the data (record-oriented models) Data Doc In standard models Data analysis: finding relationships among data Application program • uniqueness • data types • inclusion properties (foreign keys) • etc. class schema Main strategy • validating hypotheses O/R Mapping DB schema data 33 13. Semantics in the data (alternative models) Data Doc In alternative (schema-less) models Metadata extraction Application program But also data analysis as in standard models Experience • none. Too new. class schema O/R Mapping DB schema data 34 35 Recovering hidden data semantics: database reverse engineering DB reverse engineering Definition Reverse engineering a piece of software consists, among others, in recovering or reconstructing its functional and technical specifications, starting mainly from the source text of the programs. Recovering these specifications is generally intended to redocument, convert, refactor, maintain or extend existing applications. Database reverse engineering is that part of Information System Engineering that addresses the problems and techniques related to the recovery of the conceptual and logical schemas of files and databases of existing systems. 36 37 DB reverse engineering DB reverse engineering methodology Project planning Pilote Source management Full project Physical extraction Logical extraction Conceptualization Logical (RDB) schema Conceptual schema 38 DB reverse engineering DB reverse engineering methodology Project planning Pilote Source management Sch. analysis Data analysis Prog. analysis Full project Physical extraction Class analysis UI analysis Logical extraction Others De-optimization Conceptualization Untranslation Normalization 39 Is data semantics recovery that important, actually? 40 Definitely! Yes Can you prove it? At least I can show you an example 41 Example: database application migration Porting a complete existing application, or some of its components, on another, generally more modern, platform. For a database: changing its DMS. A popular example: migrating the legacy set of files of a business application to a RDBMS. Two main approaches : • • physical approach semantic approach 42 Physical database migration Database migration The physical, or one-to-one migration strategy is the cheapest but also the worst approach since it deeply degrades the final structure. Requires no knowledge on data semantics Very popular Physical (file) schema Physical extraction COBOL code Transform Physical (DB2) schema Coding SQL-DDL code 43 Physical database migration physical (one-to-one) migration SELECT CLIENT ASSIGN TO "CUST.DAT" ORGANIZATION IS INDEXED RECORD KEY IS CUST_ID. FD CUST-FILE. 01 CUSTOMER. 02 CUST-ID PIC X(12). 02 CUST-INFO PIC X(80). 02 CUST-HIST PIC X(1000). = CUSTOMER CUST-ID: char (12) CUST-INFO: char (80) CUST-HIST: char (1000) id: CUST-ID Create table CUSTOMER( CUST_ID char(12) not null, CUST_INFO char(80) not null, CUST_HIST char(1000) not null, primary key (CUST_ID)); = CUSTOMER CUST_ID: char (12) CUST_INFO: char (80) CUST_HIST: char (1000) id: CUST_ID 44 Semantic database migration Database migration Semantic approach: based on an in-depth understanding of the semantics of source data. Provides a high quality result. Strong basis for the future. Requires a complete, up to date, knowledge of the DB Conceptualization Logical (DBTG) schema Logical extraction Reverse Engineering Physical (IDMS) schema Physical extraction IDMS-DDL code COBOL code Conceptual Conceptual schema schema Logical Logical design design Physical Physical design design Logical(RDB) (RDB) Logical schema schema Physical (DB2) (DB2) Physical schema schema Coding Coding SQL-DDLcode code SQL-DDL 45 Semantic database migration (1) semantic migration (refinement) SELECT CLIENT ASSIGN TO "CUST.DAT" ORGANIZATION IS INDEXED RECORD KEY IS CUST_ID. FD CUST-FILE. 01 CUSTOMER. 02 CUST-ID PIC X(12). 02 CUST-INFO PIC X(80). 02 CUST-HIST PIC X(1000). + CUSTOMER CUST-ID: char (12) CUST-INFO: compound (70) NAME: char (20) ADDRESS: char (40) STATUS: char (10) CUST-HIST -PURCH[0-100] array: compound (10) ITEM: num (5) T OTAL: num (5) id: CUST-ID id(CUST -HIST -PURCH): ITEM CUSTOMER CUST-ID: char (12) CUST-INFO: com pound (70) NAME: char (20) ADDRESS: char (40) STATUS: char (10) id: CUST-ID 0-100 record 1-1 CUST-HIST-PURCH Index: index (4) ITEM: num (5) TOTAL: num (5) id: record.CUSTOMER ITEM id': record.CUSTOMER Index 46 Semantic database migration (2) semantic migration (SQL translation) CUSTOMER CUST-ID: char (12) CUST-INFO: com pound (70) NAME: char (20) ADDRESS: char (40) STATUS: char (10) id: CUST-ID 0-100 record Create table CUSTOMER( CUST_ID char(12) not CUST_NAME char(28) not CUST_ADDRESS char(60) not CUST_STATUS char(2) not primary key (CUST_ID)); 1-1 CUST-HIST-PURCH ITEM: num (5) Index: index (4) TOTAL: num (5) id: record.CUSTOMER ITEM id': record.CUSTOMER Index null, null, null, null, Create table CUST_HIST_PURCH( CUST_ID char(12) not null, ITEM char(10) not null, CINDEX smallint not null check(CINDEX <= 100), TOTAL smallint not null, primary key (CUST_ID,ITEM), unique (CUST_ID,CINDEX), foreign key (CUST_ID) reference CUSTOMER); CUSTOMER CUST_ID CUS_NAME CUS_ADDRESS CUS_STATUS id: CUST_ID CUST_HIST_PURCH CUST_ID ITEM CINDEX TOTAL id: CUST_ID ITEM id': CUST_ID CINDEX ref: CUST_ID No m ore than 100 CUST_HIST_PURCH per CUSTOMER 47 Database migration - Synthesis physical migration Create table CUSTOMER( CUST_ID char(12) not null, CUST_INFO char(80) not null, CUST_HIST char(1000) not null, primary key (CUST_ID)); Create table CUSTOMER( CUST_ID char(12) not CUST_NAME char(28) not CUST_ADDRESS char(60) not CUST_STATUS char(2) not primary key (CUST_ID)); semantic migration null, null, null, null, Create table CUST_HIST_PURCH( CUST_ID char(12) not null, ITEM char(10) not null, CINDEX smallint not null check(CINDEX <= 100), TOTAL smallint not null, primary key (CUST_ID,ITEM), unique (CUST_ID,CINDEX), foreign key (CUST_ID) reference CUSTOMER); 48 Evolution new application: compute total sales per item CUSTOMER CUST-ID: char (12) CUST-INFO: char (80) CUST-HIST: char (1000) id: CUST-ID CUSTOMER CUST_ID CUS_NAME CUS_ADDRESS CUS_STATUS id: CUST_ID ? CUST_HIST_PURCH CUST_ID ITEM CINDEX TOTAL id: CUST_ID ITEM id': CUST_ID CINDEX ref: CUST_ID Select ITEM, sum(TOTAL) from CUST_HIST_PURCH group by ITEM; • where is the required information? • clearly visible + documentation if needed • how to extract it from the CUSTOMER table? • just name the columns • who will develop the (C, Java, VB) program? • by any non expert • … and when? • immediately, 2 minutes 49 Summary and conclusions Some mundane observations 50 • Theories (e.g., text books) teach that the conceptual schema must be the unique expression of data semantics. In an ideal world, the conceptual schema exists, and all the other artefacts (DB schemas, UML diagrams, views, class schema, programs, UI) derive from it and capture each a part of this semantics. • However, the real world doesn't learn from theories. Most often, the conceptual schema does not exist so that only the other artefacts bear traces of the data semantics. • Identifying, extracting, understanding and merging these traces to rebuilt the conceptual schema are the very goals of database reverse engineering. Cultural aspects of data semantics expression 1. Small personal application Mainly non-professional developers. Intuitive, bottom-up, incremental development. Weak culture in DB. Data semantics: in the UI, in application code 2. Database (record-oriented) data-intensive processing Professional developers. Disciplined, top-down development. Strong culture in DB. Data semantics: in the DB schema (including additional constraints). 3. OO data-intensive processing Professional developers. OO minded. Disciplined, top-down development. Weak culture in DB. Data semantics: in the class schema (through O/RM middleware). 4. Big data (Semi-)Professional developers. Low complexity applications. RDB discarded as old-style (however NewSQL DBMS are lurking!) Data semantics: simple, loose (few constraints); metadata in data 51 52 Evolution of data semantics expression Quality of DS representation 1950 - 1975: file-oriented processing Semantics in record schema and application code prog 1968 - 1990: hierarchical/network database processing Semantics in DB schema 1980 - ?: DB relational database processing Semantics in DB schema DB 1990 - 2000: object-oriented DB processing Semantics in DB schema and application code (methods) 2000 - ?: object-relational DB processing Semantics in DB schema 2000 - ?: O/RM processing Semantics in class schema 2005 - ?: prog NoSQL DB processing DB prog prog Semantics in data and in application code 2011 - ?: NewSQL DB processing Semantics in DB schema DB Some conclusions Quite often, developers see the database as a mere repository for the data used and created by programs: • "the database offers persistence services for the business logic layer" • "the database is an implementation of the program classes" So, the database is directly dependent on the current state of program architecture. This view entails much problems when long term maintenance and evolution are concerned. When the program changes, the database schema often must be modified accordingly, even if its semantics does not change. It makes the joy of researchers in system evolution but lets the practitioners less enthousiast. The view of the database as a model of the application domain ensures a great stability of business systems. Is the database culture still living among today developers? 53 54 Thanks 55 56 57 Abstract of the lecture The role of databases may sometimes appear controversial since they are mere basic services for a significant part of the the software engineering community (the transparent "persistence layer") while they are the central component of business application for the database community. In this lecture, we examine the evolution of the balance database/program both in time (from the early sixties to a foreseenable future) and in space (technologies, communities) from the data semantics point of view. In particular we analyze and compare how and where data semantics has been located and implemented in each of these contexts. Current development practices tend to migrate semantics from the database (as was usual in the eighties and nineties) to the application logic (e.g., O/RM, NoSQL DB managers), a trend that may be seen of regression that reminds us the infancy of business application development where files were dedicated to one application. Finally, the lecture defines how data semantics can be recovered in these scenarios.