Download Data Matters Most: But where has all the semantics gone

Document related concepts

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
February 5, 2014
Stevens Award lecture
WCRE-CSMR 2014
PReCISE
Data matters most
but where has all the semantics gone?
- A (sort of) spatio-temporal view of DB reverse engineering -
Jean-Luc Hainaut
University of Namur
Faculté d'informatique
PReCISE Research Center - Database Engineering Group
www.info.fundp.ac.be/libd
2
•
Introduction
•
Understanding data semantics
•
Data models
•
Tracing data semantics
•
Recovering hidden data semantic
•
Is data semantics recovery that important, actually?
•
Summary and conclusions
3
Introduction
Objectives of the lecture
4
1. To study the concept of data semantics in business applications
2. To identify and evaluate the techniques used to represent data semantics
3. To observe how these techniques have evolved in time and in different cultures.
4. To discuss the methods used to recover the semantics lost when poor
representation techniques have been used.
The role of data in business applications
5
Axioms on databases
1. The database is a picture of the application domain
•
Its schema is a model of the static structures of the domain
•
Its data describe the current state (or suite thereof) of the domain
2. The database is designed independently of the application programs
The database is designed before the application programs
3. The database schema evolution translates the evolution of the functional requirements
4. The database is described by (at least) two schemas:
• the conceptual schema: abstract, platform-independent
formalism: ER model, conceptual UML class diagrams
• the logical schema: concrete, platform-dependent
formalism: SQL2, Java classes
There exists a bidirectional mapping between both.
The role of data in business applications
Meta-axioms on axioms on databases
1. The axioms often are ignored by developers
- ignore = how interesting! I didn't know them
- ignore = I know them but they do not suit my way of working
3. The biggest violation of the axioms concern the existence and role of the
conceptual schema
6
7
Understanding data semantics
Experimental approach and first conclusions
8
Preliminary question
Same data, different structures
CUSTOMER
CustID
C400
B512
S144
Name
Darwen
Owens
Garcia
City
London
NY
Madrid
To what extent does each of these data sets
expresses the semantics of data?
T
C1
C400
B512
S144
C2
C3
Darwen
Owens
Garcia
London
NY
Madrid
T
C
C400
B512
S144
Darwen
Owens
Garcia
London
NY
Madrid
Motivating example. 1. Reading data from a COBOL file (1970)
application code (COBOL)
CUSTOMER
CustID
B512
Name
Owens
City
NY
WORKING-STORAGE SECTION.
01 CUSTOMER.
02 CustID PIC X(12).
02 Name PIC X(60).
02 City PIC X(40).
external file
READ FILE1
INTO CUSTOMER.
REC
RKEY
C400
B512
S144
RINFO
Darwen London
Owens NY
Garcia Madrid
SELECT FILE1
ASSIGN TO "FILE1.DAT"
ORGANIZATION IS INDEXED
ACCESS MODE IS DYNAMIC
RECORD KEY IS RKEY.
FD FILE1.
01 REC.
02 RKEY PIC X(12).
02 RINFO PIC X(100).
CUSTOMER
CustID
Name
City
REC
RKEY
RINFO
9
Motivating example: 1. Reading data from a COBOL file (1970)
93%
10%
CUSTOMER
REC
CustID
Name
City
RKEY
RINFO
Where has data semantics been defined?
• In file description (10%) - [unique key, key data type]
• In application code (93%).
10
11
Motivating example. 2. Reading data from an RDB (1980+)
application code (C)
v1
B512
v2
Owens
v3
NY
string v1;
string v2;
string v3;
select *
v1,v2,v3
CUSTOMER
CustID =
into
from
where
'B512'
Relational DB
CUSTOMER
CustID
C400
B512
S144
Name
Darwen
Owens
Garcia
City
London
NY
Madrid
create table CUSTOMER(
CustID char(12) not null,
Name char(60) not null,
City char(40) not null,
primary key (CustID)).
v1
CUSTOMER
v2
CustID
Name
City
v3
Motivating example: 2. Reading data from an RDB (1980+)
3%
100%
v1
CUSTOMER
v2
CustID
Name
City
v3
Where has data semantics been defined?
• In DB schema (100%)
• In application code (3%) - [data type].
12
What does data semantics mean?
A tentative practical definition
Data semantics is the knowledge defined by all the
non technical,
domain-dependent,
information
that allows us to understand, to use and to manage the data.
13
Where can we find traces of data semantics?
Application
program
DB schema
data
in the application code (reading from file)
in the DB schema (reading from DB)
14
A first (trivial) observation
It is best to express data semantics in the database schema
1. Expressiveness: DDL is the most appropriate language to declare data
structures and constraints
2. Language independence: DDL is independent of application programming
languages
3. Uniqueness: the schema is unique and centralized
4. Integration with data: the schema is a part of the database (no risk to loose it!))
5. Program independence: the schema is independent of application programs
6. Stability. The schema must be changed only when the application domain evolve.
15
However, things are not always that simple (e.g.,COBOL files)
Only data structures are explicit in application programs:
• record name
• field name
• field data type
Additional constraints generally are controlled by the application code:
• where?
• in which way?
• in all the modules processing the data?
Understanding data semantics by analyzing the program code can be much complex
than expected.
16
However, things are not always that simple (e.g., RDB)
Only standard integrity constraints can be coded through the DDL (SQL2):
• not null
• uniqueness
• referential integrity
Additional constraints must be coded through generic means:
• check predicates
• triggers
• store procedures
Understanding data semantics by reading the database schema can be less easy than
expected.
17
18
Data models
19
Data models: abstraction hierarchy
Reminder on the database design process - The standard view
User
requirements
Information
analysis
Conceptual
schema
Logical
design
Logical (RDB)
schema
Physical
design
Physical (DB2)
schema
Coding
SQL-DDL code
999. Data semantics and data models
The way data semantics is expressed in a database depends on its data model
Conceptual models
• ER (*)
• UML class diagrams
Logical models
• Record oriented models:
• files
• legacy DBMS (IMS, CODASYL)
• RDB (*)
• Key-Value models:
• NoSQL (*)
• CSV
• Structured object models:
• OO
• NoSQL
• Json (*)
• XML
20
21
ER conceptual model
Abstract, platform-independent information description
The world is perceived as:
- sets of entities,
- properties that characterize entities
- relationships holding between entities
CUSTOMER
Cus tID
Nam e
City
id: Cus tID
0-N
place
1-1
A conceptual schema can be translated into several logical,
DBMS-dependent, schemas
ORDER
OrdID
DateOrd
Account
id: OrdID
22
Relational data model (schema-based, 1NF)
CustID
C400
B512
S144
•
•
•
•
•
Name
Darwen
Owens
Garcia
City
London
NY
Madrid
Account
metadata
-124
5509
0
data
Domain-dependent schema
Schema and data are hierarchically distinct
Values are aggregated into rows
The semantics is explicit in the schema (part of!)
The semantic is managed/controlled by the DBMS
Examples: Oracle, DB2, SQL Server, MySQL, PostgreSQL, etc.
Key-Value data model (schema-less, triples, 1NF)
ENTITY
90317
90317
90317
90317
59731
59731
59731
59731
66830
66830
66830
66830
•
•
•
•
•
ATTRIBUTE
CustID
Name
City
Account
CustID
Name
City
Account
CustID
Name
City
Account
VALUE
C400
Darwen
London
-124
B512
Owens
NY
5509
S144
Garcia
Madrid
0
meta-metadata
metadata
data
Domain-independent schema
Metadata mixed with data
Elementary Key-Value
The semantics is explicit in the data
The semantics is managed/controlled by application programs or middleware
Examples: Oracle NoSQL, BerkeleyDB, Voldemort, Riak, Redis
23
24
Structured object data models (schema-less, NF2)
ENTITY
90317
59731
66830
ATTRIBUTES
{"CustID": "C400", "Name": "Darwen","City": "London", "Account": 124}
{"CustID": "B512", "Name": "Owens", "City": "NY",
"Account": 5509}
{"CustID": "S144", "Name": "Garcia", "City": "Madrid", "Account": 0}
meta-metadata
metadata
data
•
•
•
•
•
Domain-independent schema
Metadata mixed with data
Aggregated Key-Value into objects (here in Json)
The semantics is explicit in the data
The semantic is managed/controlled by application programs or middleware
Examples: CouchDB, MongoDB (BSON), SimpleDB
25
Tracing data semantics
In the real world, where is semantics expressed?
We have identified two places: DB schema and application code.
Are there other places?
26
27
Architectural framework
Documentation (text, structured, ontology)
Doc
User interface
- data structure
- labels
- help, error messages)
Application
program
Application code
- data structures
- procedural code)
Class schema
class schema
Object/Relational mapping
O/R
Mapping
DB schema
data
DB logical schema
- global schema
- views
Data
Semantics in the documentation
Documentation (text, structured, ontology)
Doc
Functional documentation (should include the
conceptual schema)
Application
program
Technical documentation (should include the logical
schema)
Drawback
the documentation often is
• obsolete,
• incomplete,
• inconsistent
• missing
class schema
O/R
Mapping
DB schema
data
28
8. Semantics in the DB schema
DB logical schema
- global logical schema
- views
Doc
The logical schema is DBMS-dependent.
Application
program
class schema
It is a more or less faithful implementation of the
conceptual schema.
Some views can be more detailed than the logical
schema.
Drawbacks
• not a conceptual schema
• additional constraints not always trivial to
identify and to understand
O/R
Mapping
DB schema
data
29
10. Semantics in the class schema
Class schema
T
Doc
DB logical schema
Bidirectional relation/object transformation.
Application
program
Solving the impedance mismatch problem
The class schema seen as the domain model.
It is implemented into a relational database, which
ensures object persistence.
class schema
The DB schema itself is hidden and may bear little
semantics.
O/R
Mapping
DB schema
data
Drawbacks
• inappropriate formalism
• poor change propagation mechanism (if any)
• semantics in the application and not in the DB
• data model not easily shared by several
applications
30
11. Semantics in the application code
Application code
- data structures
- procedural code
Doc
Internal data structures may be more explicit that the
DB schema.
Application
program
Data integrity constraints checked by the application
code.
Understanding data semantics from the way
programs process the data.
class schema
However, program analysis is far from trivial:
• size (millions of LOC)
• architectural complexity
• algorithmic complexity
• data flow complexity
• creative data processing
O/R
Mapping
DB schema
data
Drawbacks
• redundancies (a constraint may be checked in
many places)
• distributed traces (potential inconsistencies)
31
32
12. Semantics in the GUI
User interface
- data structure
- labels
- help, error messages)
Doc
The UI often is a view on a part of the database.
This view is intended for users  user friendly.
Application
program
Provides useful hints about the constraints and
meaning of data:
•
•
•
•
class schema
O/R
Mapping
data structure (data types, aggregates)
explicit labels
sample data
informative help and error messages
Drawbacks
• distributed control (potential inconsistencies)
• does not cover all the database objects
DB schema
data
13. Semantics in the data (record-oriented models)
Data
Doc
In standard models
Data analysis: finding relationships among data
Application
program
• uniqueness
• data types
• inclusion properties (foreign keys)
• etc.
class schema
Main strategy
• validating hypotheses
O/R
Mapping
DB schema
data
33
13. Semantics in the data (alternative models)
Data
Doc
In alternative (schema-less) models
Metadata extraction
Application
program
But also data analysis as in standard models
Experience
• none. Too new.
class schema
O/R
Mapping
DB schema
data
34
35
Recovering hidden data semantics:
database reverse engineering
DB reverse engineering
Definition
Reverse engineering a piece of software consists, among others, in recovering
or reconstructing its functional and technical specifications, starting mainly
from the source text of the programs. Recovering these specifications is generally
intended to redocument, convert, refactor, maintain or extend existing
applications.
Database reverse engineering is that part of Information System Engineering
that addresses the problems and techniques related to the recovery of the
conceptual and logical schemas of files and databases of existing systems.
36
37
DB reverse engineering
DB reverse engineering methodology
Project
planning
Pilote
Source
management
Full project
Physical
extraction
Logical
extraction
Conceptualization
Logical (RDB)
schema
Conceptual
schema
38
DB reverse engineering
DB reverse engineering methodology
Project
planning
Pilote
Source
management
Sch. analysis
Data analysis
Prog. analysis
Full project
Physical
extraction
Class analysis
UI analysis
Logical
extraction
Others
De-optimization
Conceptualization
Untranslation
Normalization
39
Is data semantics recovery that important, actually?
40
Definitely!
Yes
Can you prove it? At least I can show you an example
41
Example: database application migration

Porting a complete existing application, or some of its components, on another, generally
more modern, platform.

For a database: changing its DMS. A popular example: migrating the legacy set of files of
a business application to a RDBMS.

Two main approaches :
•
•
physical approach
semantic approach
42
Physical database migration
Database migration

The physical, or one-to-one migration strategy is the cheapest but also the worst
approach since it deeply degrades the final structure.
Requires no knowledge on data semantics  Very popular
Physical (file)
schema
Physical
extraction
COBOL code
Transform
Physical (DB2)
schema
Coding
SQL-DDL code
43
Physical database migration

physical (one-to-one) migration
SELECT CLIENT ASSIGN TO "CUST.DAT"
ORGANIZATION IS INDEXED
RECORD KEY IS CUST_ID.
FD CUST-FILE.
01 CUSTOMER.
02 CUST-ID PIC X(12).
02 CUST-INFO PIC X(80).
02 CUST-HIST PIC X(1000).
=
CUSTOMER
CUST-ID: char (12)
CUST-INFO: char (80)
CUST-HIST: char (1000)
id: CUST-ID

Create table CUSTOMER(
CUST_ID char(12) not null,
CUST_INFO char(80) not null,
CUST_HIST char(1000) not null,
primary key (CUST_ID));
=
CUSTOMER
CUST_ID: char (12)
CUST_INFO: char (80)
CUST_HIST: char (1000)
id: CUST_ID
44
Semantic database migration
Database migration

Semantic approach: based on an in-depth understanding of the semantics of source data.
Provides a high quality result. Strong basis for the future.
Requires a complete, up to date, knowledge of the DB
Conceptualization
Logical (DBTG)
schema
Logical
extraction
Reverse
Engineering
Physical (IDMS)
schema
Physical
extraction
IDMS-DDL
code
COBOL code
Conceptual
Conceptual
schema
schema
Logical
Logical
design
design
Physical
Physical
design
design
Logical(RDB)
(RDB)
Logical
schema
schema
Physical (DB2)
(DB2)
Physical
schema
schema
Coding
Coding
SQL-DDLcode
code
SQL-DDL
45
Semantic database migration (1)

semantic migration (refinement)
SELECT CLIENT ASSIGN TO "CUST.DAT"
ORGANIZATION IS INDEXED
RECORD KEY IS CUST_ID.
FD CUST-FILE.
01 CUSTOMER.
02 CUST-ID PIC X(12).
02 CUST-INFO PIC X(80).
02 CUST-HIST PIC X(1000).
+

CUSTOMER
CUST-ID: char (12)
CUST-INFO: compound (70)
NAME: char (20)
ADDRESS: char (40)
STATUS: char (10)
CUST-HIST -PURCH[0-100] array: compound (10)
ITEM: num (5)
T OTAL: num (5)
id: CUST-ID
id(CUST -HIST -PURCH):
ITEM

CUSTOMER
CUST-ID: char (12)
CUST-INFO: com pound (70)
NAME: char (20)
ADDRESS: char (40)
STATUS: char (10)
id: CUST-ID
0-100
record
1-1
CUST-HIST-PURCH
Index: index (4)
ITEM: num (5)
TOTAL: num (5)
id: record.CUSTOMER
ITEM
id': record.CUSTOMER
Index
46
Semantic database migration (2)

semantic migration (SQL translation)
CUSTOMER
CUST-ID: char (12)
CUST-INFO: com pound (70)
NAME: char (20)
ADDRESS: char (40)
STATUS: char (10)
id: CUST-ID
0-100
record
Create table CUSTOMER(
CUST_ID
char(12) not
CUST_NAME
char(28) not
CUST_ADDRESS char(60) not
CUST_STATUS char(2) not
primary key (CUST_ID));
1-1
CUST-HIST-PURCH
ITEM: num (5)
Index: index (4)
TOTAL: num (5)
id: record.CUSTOMER
ITEM
id': record.CUSTOMER
Index

null,
null,
null,
null,
Create table CUST_HIST_PURCH(
CUST_ID
char(12) not null,
ITEM
char(10) not null,
CINDEX
smallint not null check(CINDEX <= 100),
TOTAL
smallint not null,
primary key (CUST_ID,ITEM),
unique (CUST_ID,CINDEX),
foreign key (CUST_ID) reference CUSTOMER);

CUSTOMER
CUST_ID
CUS_NAME
CUS_ADDRESS
CUS_STATUS
id: CUST_ID
CUST_HIST_PURCH
CUST_ID
ITEM
CINDEX
TOTAL
id: CUST_ID
ITEM
id': CUST_ID
CINDEX
ref: CUST_ID
No m ore than
100 CUST_HIST_PURCH
per CUSTOMER
47
Database migration - Synthesis
physical migration
Create table CUSTOMER(
CUST_ID char(12) not null,
CUST_INFO char(80) not null,
CUST_HIST char(1000) not null,
primary key (CUST_ID));
Create table CUSTOMER(
CUST_ID
char(12) not
CUST_NAME
char(28) not
CUST_ADDRESS char(60) not
CUST_STATUS char(2) not
primary key (CUST_ID));
semantic migration
null,
null,
null,
null,
Create table CUST_HIST_PURCH(
CUST_ID
char(12) not null,
ITEM
char(10) not null,
CINDEX
smallint not null check(CINDEX <= 100),
TOTAL
smallint not null,
primary key (CUST_ID,ITEM),
unique (CUST_ID,CINDEX),
foreign key (CUST_ID) reference CUSTOMER);
48
Evolution

new application: compute total sales per item
CUSTOMER
CUST-ID: char (12)
CUST-INFO: char (80)
CUST-HIST: char (1000)
id: CUST-ID
CUSTOMER
CUST_ID
CUS_NAME
CUS_ADDRESS
CUS_STATUS
id: CUST_ID


?
CUST_HIST_PURCH
CUST_ID
ITEM
CINDEX
TOTAL
id: CUST_ID
ITEM
id': CUST_ID
CINDEX
ref: CUST_ID
Select ITEM, sum(TOTAL)
from
CUST_HIST_PURCH
group by ITEM;
• where is the required information?
• clearly visible + documentation if needed
• how to extract it from the CUSTOMER table?
• just name the columns
• who will develop the (C, Java, VB) program?
• by any non expert
• … and when?
• immediately, 2 minutes
49
Summary and conclusions
Some mundane observations
50
• Theories (e.g., text books) teach that the conceptual schema must be the unique expression of
data semantics. In an ideal world, the conceptual schema exists, and all the other artefacts (DB
schemas, UML diagrams, views, class schema, programs, UI) derive from it and capture each a
part of this semantics.
• However, the real world doesn't learn from theories. Most often, the conceptual schema does
not exist so that only the other artefacts bear traces of the data semantics.
• Identifying, extracting, understanding and merging these traces to rebuilt the conceptual schema
are the very goals of database reverse engineering.
Cultural aspects of data semantics expression
1. Small personal application
Mainly non-professional developers. Intuitive, bottom-up, incremental development.
Weak culture in DB.
Data semantics: in the UI, in application code
2. Database (record-oriented) data-intensive processing
Professional developers. Disciplined, top-down development.
Strong culture in DB.
Data semantics: in the DB schema (including additional constraints).
3. OO data-intensive processing
Professional developers. OO minded. Disciplined, top-down development.
Weak culture in DB.
Data semantics: in the class schema (through O/RM middleware).
4. Big data
(Semi-)Professional developers. Low complexity applications.
RDB discarded as old-style (however NewSQL DBMS are lurking!)
Data semantics: simple, loose (few constraints); metadata in data
51
52
Evolution of data semantics expression
Quality of DS representation
1950 - 1975: file-oriented processing
Semantics in record schema and application code
prog
1968 - 1990: hierarchical/network database processing
Semantics in DB schema
1980 - ?:
DB
relational database processing
Semantics in DB schema
DB
1990 - 2000: object-oriented DB processing
Semantics in DB schema and application code (methods)
2000 - ?:
object-relational DB processing
Semantics in DB schema
2000 - ?:
O/RM processing
Semantics in class schema
2005 - ?:
prog
NoSQL DB processing
DB
prog
prog
Semantics in data and in application code
2011 - ?:
NewSQL DB processing
Semantics in DB schema
DB
Some conclusions
Quite often, developers see the database as a mere repository for the data used and
created by programs:
• "the database offers persistence services for the business logic layer"
• "the database is an implementation of the program classes"
So, the database is directly dependent on the current state of program architecture.
This view entails much problems when long term maintenance and evolution are
concerned. When the program changes, the database schema often must be modified
accordingly, even if its semantics does not change.
It makes the joy of researchers in system evolution but lets the practitioners less
enthousiast.
The view of the database as a model of the application domain ensures a great stability
of business systems.
Is the database culture still living among today developers?
53
54
Thanks
55
56
57
Abstract of the lecture
The role of databases may sometimes appear controversial since they are mere basic services for a significant part of the the software engineering
community (the transparent "persistence layer") while they are the central component of business application for the database community. In this lecture,
we examine the evolution of the balance database/program both in time (from the early sixties to a foreseenable future) and in space (technologies,
communities) from the data semantics point of view. In particular we analyze and compare how and where data semantics has been located and
implemented in each of these contexts. Current development practices tend to migrate semantics from the database (as was usual in the eighties and
nineties) to the application logic (e.g., O/RM, NoSQL DB managers), a trend that may be seen of regression that reminds us the infancy of business
application development where files were dedicated to one application.
Finally, the lecture defines how data semantics can be recovered in these scenarios.