Download 2. SDMX Istat framework

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
ISTAT SDMX Framework
Submitted by
Francesco Rizzo, Laura Vignola, Dario Camol, Mauro Bianchi
National Institute of Statistics (ISTAT)
Rome, Italy
1. THE REASON WHY ISTAT IS STARTING TO USE SDMX
3
2. SDMX ISTAT FRAMEWORK
3
2.1 Short-term time series database (ConIstat)
2.1.1 Check and Loader software modules
2.1.2 ConIstat Web Navigator software module
4
5
6
2.2. Changes and new modules
2.2.1 Database changes
2.2.2 Extensions in the Check and Loader modules
2.2.3 The web service module
2.2.4 SDMX Web Navigator
2.2.5 Manager and web navigator reference metadata module
4.1 A GENERALIZED SDMX QUERY PARSER MODULE (BETA VERSION)
7
7
8
9
11
12
13
2
1. The reason why Istat is starting to use SDMX
Istat is following the evolution of the SDMX initiative with interest taking the following two ways
into account:
1. tactically, through the participation in the EUROSTAT SODI initiative;
2. strategically, through the building of a working group whose main aim is to analyze and verify
the use of SDMX in the internal architecture of the Istat Information System.
The actual Istat Information System is supported by a distributed architecture. Several production
Directorates operate through local sub-systems that, independently, cover the full life cycle of
statistical data, from data collection to data dissemination. So Data and Metadata dissemination are
currently located in different database maintained by different production Directorates.
To access these sub-systems, users must use navigation interfaces that are completely different from
each other.
The need to uniform the data search using the same interface has become a very big requirement.
To achieve this objective Istat is involved in developing an INTEGRATED OUTPUT MANAGEMENT
SYSTEM. This is an Information System oriented towards the integration of part of the life cycle of
statistical data of the Institute, with particular emphasis placed on data dissemination.
The experience that we are acquiring participating in the SODI project will allow us to support the
strategic interest at SDMX in Istat.
In order to facilitate this objective, we are developing a framework consisting of various compatible
SDMX software modules. The framework could be used entirely from the reporting phase to the
dissemination phase, or alternatively using the modules separately, integrating them into one
Information System.
In the future it might be possible to distribute the framework under a Public Licence.
This document describes the architecture and the modules of the SDMX ISTAT Framework version
1.0
2. SDMX Istat framework
Istat is taking part in the SODI project with an internal working group made up of one analyst and
four programmers. The project is divided in the following phases:
 studying of SDMX;
 reporting analysis – to know where data is stored, particularly STS and ESA data. We
discovered that most of this data is stored in ConIstat database (shot-term time-series database,
visible at the URL: http://con.istat.it).
 analysis and design – to describe which existing modules must be extended and which new
modules must be made. In particular the system must perform the following functions:
o extend data reporting in order to integrate the existing database with other data required by the
SODI project;
o mapping between already existing data structure and DSDs defined by Eurostat;
o provide a web service capable of accepting a SDMX Query and responding with a
SDMX Compact;
o provide a RSS file that informs when new data is loaded or updated, which then
specifies the URL where to find the files containing the SDMX Query which then
locates the new or updated data;
o provide a web client program which can be connected to web service. This client is able
to query the database using the statistical concepts defined into DSDs. As well as
visualizing the results of the search, in XML and HTML format, it is able to visualize
the SDMX query that will then be sent to the web service.
3
The following diagram shows the existing system (ConIstat) and new modules added to satisfy the
SODI project.
Reference
metadata manager Reference metadata RSS
Reference
metadata web navigator
Reference metadata SDMX Query
Reference
metadata web service
Reference
metadata
database
Word extractor
STS
SDMX web navigator
STS
ESA
mapping
sodi.istat.it
ESA
SDMX data web service
time-series
database
sodi.istat.it/sodiWs/service1.asmx
.ist
gesmes
ConIstat
web navigator
RSS builder
&
query builder
con.istat.it
.dat
Check
Loader
ConIstat modules
sodi.istat.it/query/*.xml
sodi.istat.it/RSS/rss.xml
new modules
not complited modules
2.1 Short-term time series database (ConIstat)
The Istat Short-term time series database stores 16.000 time series. The following statistical subjectmatter domain list shows how these time series are organized:
 Price
o Consumer price
o Producer price
 Services
o Retail trade sales
o Tourism
o Other services
 Employment, wages and other labour indicators
o Large firms labour indicators
4





o Large firms labour indicators (seasonally adjusted data)
o Large firms labour indicators (data adjusted for calendar effects)
o Quarterly indicators on employment, earnings and social security contribution
Construction
o Production in construction
Industry
o Turnover
o New orders
o Industrial production
o Orders stock
Foreign trade
o Imports
o Balances
o Exports
o Exports in Italian regions
Labour force
o Labour force
o Employment
o Unemployment
o Seasonally adjusted data
o Rates
o Non labour force
o Population
Quarterly National Accounts
o Income statement of jobs and resources fob-fob
o Final consumption of households
o Costs and margins
o Gross fixed capital formation
o Imports and exports fob-fob
o Labour units and income
o Value added at basic price
o Value added at produced price
o Contribution to percent change in GDP
The ConIstat system consists of the time series database and some software tools that cover all the
data flow from reporting to dissemination.
2.1.1 Check and Loader software modules
Production Directorates organize updating data in a fix formatted record file (.dat). Look at this
example of a file for the Production Price:
ppa19grCB
ppa19grD
ppa19grDA
ppa19grDG
ppa19grE
ppa19grCB14
200611
200611
200611
200611
200611
200611
119,9
114,8
111,8
114,5
149,3
119,9
5
Check and Loader software modules before the extension
Check
Loader
Intermediate
fix formatted
record file
Fix formatted
record file
time series
database
Dissemination
fix formatted
record file
The person responsible for sending data for a particular production Directorate, uses the Check
module to prepare a “checked file” free from some type of errors. Generally the “.dat” files are
prepared automatically by the production Directorate sub-systems, but occasionally they could be
prepared manually, so a check process is necessary.
This “checked file” is sent to a centralized structure which provides loading, through the Loader
module, into the “time series database”.
The Loader module besides loading data in the database, prepares “Dissemination fix formatted record file”
(one for each sub-domain involved in the update).
Look at this example of a fixed formatted record file for the Production Price:
0040
80,2
80,2
……
80,1
1991/1
1991/2
……
1991/8
0050
77,5
77,8
…….
79
0060
80,7
80,9
……
82,2
C
82,6
82,9
……
83
CA
90,2
90,5
……
86,8
The “Dissemination fix formatted record files” were a specific requirement of some users that need to load
their database with the uploaded data of the Istat time-series database.
2.1.2 ConIstat Web Navigator software module
This module is a web application that allows one to search for data using a complex query. A user
can simultaneously manage time-series from different statistical domains with different frequencies.
Look at this extracting example of a monthly and quarterly time series, coming from different statistical
domains:
eif13.C
Industry; turnover; national turnover index; Mining and quarrying
kflfr.IFOL
Continuous Labour Force Survey; Labour Force; by Geographical Area and Sex; Italy Female
eif13.C
2006/1
284,1
2006/2
85,7
2006/3
331,8
2006/4
282,9
2006/5
235,4
2006/6
181,0
2006/7
177,6
2006/8
150,9
2006/9
170,3
kflfr.IFOL
9.923
9.962
9.795
2006/10 196,5
2006/11 251,3
2006/12 237,1
2007/1
10.006
277,0
6
2.2. Changes and new modules
As decided in the design phase, some of the existing modules were extended, others were created
and a mapping operation, between already existing data and the DSDs defined by Eurostat were
made.
2.2.1 Database changes
The need to describe the time series in ConIstat database, using the DSDs defined by Eurostat, has
involved some changes in the database schema.
The ConIstat database schema consists of the following main tables:
 METADATA, stores a set of structural metadata;
 DATA, stores the observations for each time series;
 a group of lookup tables, that stores code lists.
In the METADATA table each row depicts a time series: each column depicts a statistical concept
(dimension or attribute).
Particularly the fields Domain and SubDomain perform the same role of the second and third level
in the “statistical subject-matter domain” list, described in the “SDMX content-oriented guide”. The
fields Category, Type, Vs, ClassificationCode, Freq, and Um refer to the statistical concepts
(dimensions and attributes) that allow to distinguish one time series from an other.
Look at a simplified diagram of the ConIstat database schema:
 Primary Key
DOMAIN
SUBDOMAIN
CATEGORY
CODE_LIST
METADATA
Domain
SubDomain
Category
Type
Vs
ClassificationCod
Freq
Um
Start-Time_Period
End-Time_Period
DATA
Domain
SubDomain
Category
Type
Vs
ClassificationCod
Time_Period
Value
The changes that have been performed in the ConIstat database schema consist of:
 creating two tables. One, named STS_METADATA, used to describe STS indicators and one,
named ESA_METADATA, used to describe ESA1 and ESA2 indicators.
These tables inherit, the base structure from METADATA. After that, in each table a group of
columns is added, whose number depends on statistical concepts used to define DSDs.
In the future it will be necessary to add other tables to store time series from other domains;
 creating some lookup tables: CONCEPTS, CODE_LIST and DATAFLOWS;
 creating a table, named KEY_FAMILIES that stores the features of each DSD. From this table,
it is possible to know which table stores the structural metadata for each domain.
Look at a schematic diagram of the database after the changes:
7
 Primary Key
Domain
SubDomain
Category
Type
Vs
ClassificationCod
Freq
Um
Start-Time_Period
End-Time_Period
Old_Code_List
New_Code_List
Concepts
DataFlows
METADATA
DATA
Domain
SubDomain
Category
Type
Vs
ClassificationCod
Time_Period
Value
STS_METADATA
ESA_METADATA
fk_Dataflow
fk_Freq
fk_Adjustment
fk_sts_Indicator
fk_sts_Activity
fk_Base_Year
fk_Dataflow
fk_Time_Periodicity
fk_Prices
fk_Field1
fk_Field2
fk_Field3
fk_Field4
fk_Unit
Fk_Unit_Multiplier
fk_Adjustment_Type
fk_Adjustment_Method
fk_Unit
fk_Unit_Mult
fk_Decimals
fk_Collection
fk_Availability
fk_TimeFormat
KEY FAMILIES
KeyFamilyName
ConceptName
ConceptType
AttachmentLevel
FieldName
TableName
Fk_Dataflow
Already existing
After changes
Look at this example describing a record in the METADATA table and the corresponding mapping,
using a DSD:
Domain: e – Industry
SubDomain: if – Turnover
Category: 13 – National Turnover Index
ClassificationCode: 0090
Type: g -Neither seasonally or working day adjusted
Um: pe – Index number (base 2000)
Mapping:
DatFlow: STSIND_TURN_M
Frequency: M
Adjustment: N
sts_Indicator: TOVD
sts_Activity: NS0090
sts_Base_Year: 2000
Time_Format: P1M
2.2.2 Extensions to the Check and Loader modules
Check and Loader modules were extended to include a new data reporting function and a new
dissemination function.
8
Check and Loader software modules after the extensions
Intermediate
fix formatted
record file
Fix formatted
record file
Check
time series
database
Loader
Dissemination
fix formatted
record file
GESMES
formatted
file
SODI RSS
SDMX Query
SDMX Compact
The new data reporting function (beta testing) allows one to collect data organized in GESMES
format. This function will allow to collect data – not yet in the database – but already available in
the production Directorates.
The new dissemination functions have been developed to satisfy the SODI requirements:
 to publish a RSS file that informs when new data is loaded or updated, which then specifies the
URL where to find the files containing the SDMX Query which then locates the new or updated
data;
 to publish one or more SDMX Query file(s);
 to publish one or more SDMX Compact file(s) (optional).
The SDMX Compact files have been created in the event that the response-times for extracting an
entire dataflow, on line, are too long. In this case instead of querying the database, the web service
will directly take the already prepared SDMX Compact file. Logically, in this case, it is not possible
to filter a specific data extract, but one must extract an entire dataset.
2.2.3 The web service module
The core of the SDMX Istat Framework is the “SDMX data web service” module. It allows the use
of the Pull exchange method to request data.
This web service can be located at the following URL: http://sodi.istat.it/sodiWS/service1.asmx.
A client software can request data from the “SDMX data web service” by sending a SDMX Query.
The client receives data in SDMX Compact format.
The “SDMX web service” implements the functions described in the following diagram:
SDMX Query
XML validate
SDMX Compact
query Parser
SDMX Compact
builder
sql query builder
Database
1. reads the SDMX Query stream;
2. verifies the XML format against a XML schema:
3. parsers the SDMX Query and decomposes it in elementary parts;
9
4.
5.
6.
7.
constructs a SQL query with the same meaning of the SDMX Query;
executes the SQL query;
reads the response of the database and creates a SDMX Compact stream;
sends the SDMX Compact stream to the client.
At the moment the “SDMX data web service” accepts SDMX Queries which contain only the
“DataWhere” section and allows queries regarding an entire Dataflow or its subsets.
Look at this example of a SDMX Query that specifies how to extract the entire Dataflow of “neither
seasonally or working day adjusted – monthly - Industrial Production”:
<Query>
<query:DataWhere>
<query:And>
<query:DataProvider>ISTAT</query:DataProvider>
<query:Dataflow>STSIND_PROD_M</query:Dataflow>
<query:Dimension name="FREQUENCY">M</query:Dimension>
<query:Dimension name="INDICATOR">PROD</query:Dimension>
<query:Dimension name="ADJUSTMENT">N</query:Dimension>
<query:Time>
<query:StartTime>2006-01</query:StartTime>
<query:EndTime>2007-01</query:EndTime>
</query:Time>
</query:And>
</query:DataWhere>
</Query>
Look at this example of a SDMX Query that specifies how to extract a subset (only NS0040,
N100CA, N11100 activities) of “neither seasonally or working day adjusted, seasonally adjusted and
working day adjusted – monthly - Industrial Production”
<Query>
<query:DataWhere>
<query:And>
<query:DataProvider>ISTAT</query:DataProvider>
<query:Dataflow>STSIND_PROD_M</query:Dataflow>
<query:Dimension name="FREQUENCY">M</query:Dimension>
<query:Dimension name="INDICATOR">PROD</query:Dimension>
<query:Time>
<query:StartTime>2006-01</query:StartTime>
<query:EndTime>2007-01</query:EndTime>
</query:Time>
<query:Or>
<query:Dimension name="STS_ACTIVITY">NS0040</query:Dimension>
<query:Dimension name="STS_ACTIVITY">N100CA</query:Dimension>
<query:Dimension name="STS_ACTIVITY">N11100</query:Dimension>
</query:Or>
</query:And>
</query:DataWhere>
</Query>
We are working on a new version of the “query parser” sub-module. It will be able to interpret a
SDMX Query, referring to different Dataflows or subsets of different Dataflows.
During design and prototyping phases we took into consideration the response-time of the “SDMX
web service” when a client submitted a query that involved a large amount of data.
The algorithmic query was optimized and we got response-time under 20 seconds in the following
test: we requested the “SDMX data web service” an entire Dataflow containing 685 time series, 120
months long, developing 82.200 observations.
10
2.2.4 SDMX Web Navigator
This module is a web application that acts as a client towards the “SDMX data web service”. This
module was created with the intention to test the web service, using a graphic interface. Then we
converted it in a real web navigator adding data presentation functions.
In general, using this module, it is possible to carry out the following functions:
 querying database using the DSDs as analysis dimensions;
 building SDMX Queries using a graphic interface;
 testing SDMX Queries.
The following image shows the interface used to query the database:
From this graphic interface a user can build a SDMX query choosing the Dataflow, that contains the
interested data, and then setting up some filters through the Dimensions and the time period.
Then user can choose if visualize or save the SDMX Query or send the query to the web service and
visualize or save the resulting SDMX Compact file.
The following image shows the “query tester”, a user graphic interface that allow one to write a
SDMX Query by himself and send it to the web service:
11
2.2.5 Manager and web navigator reference metadata module
The SODI project, as well as data and structural metadata exchange, deal with reference metadata
exchange. For this objective we are developing a software module (in beta version) as part of the
SDMX Istat Framework, that allows automating the production and disseminating of this type of
metadata.
The production Directorates, send Eurostat and IMF MS-Word files containing reference metadata
in a self-governing way through the use of templates.
The idea was not to modify current working ways, but to provide some software tools that help
improve these working ways.
The following diagrams show the process to produce reference metadata before and after the
introduction of the this module:
Reference metadata production before using the Framework
HTML
Eurostat
template
IMF
template
Production
Directorate
s
Istat
Web site
Eurostat
IMF
12
Reference metadata production after using the Framework
Databas
e
Production
Directorate
s
Reference
metadata
manager
XSLT
XML
Reference
metadata
web
navigator
Reference
metadata
web
service
Word
extractor
HTML
SDMX
Eurostat
IMF
During the analysis phase we compared the information request by Eurostat and IMF through their
“word” template. We put a list of information together considering the requests in both cases and
we designed a database schema that was able to store all the information.
Then we designed the following sub-modules:
 reference metadata manager (beta testing)
 reference metadata web navigator (beta testing)
 word extractor (beta testing)
 SDMX reference metadata web service (under construction)
The person responsible for filling in the templates can now use the “reference metadata manager”
sub-module. This sub-module, through a web graphic interface, allows the storage of all the
necessary information regarding reference metadata for a particular dataset.
The information flow in this sub-module is organized by means of documents.
A document contains all the reference metadata for a particular dataset in a particular time period
and can be in the following states:
 on working - the document is not complete;
 posted - the document is complete and ready to be sent to Eurostat and/or IMF.
For a dataset, it is possible to have different documents made at different times (for example one for
each year or one for each quarter).
This sub-module allows the following:
 start a new session from scratch or recall an already “posted” document;
 insert information about reference metadata filling in the text fields linked to each level;
 interrupt the session and continue it in a different moment;
 “post” (make definitive) the document.
The “Word extractor” sub-module allows one to export the information stored in the database in
word format files in accordance with the templates. So the production Directorates can use this
module to produce, also, the files that they must send to Eurostat and IMF.
4.1 A generalized SDMX Query parser module (beta version)
As specified in paragraph 2.2.1 the “SDMX data web service” accepts SDMX Queries which
contain only the DataWhere section and permits queries regarding an entire Dataflow or its subsets.
This module continues to accept only queries with DataWhere section, but it also accepts queries
regarding different Dataflows and subsets of different Dataflows. This module permits to extract at
the same time, for example, the following time series:
 Industrial production - Seasonally adjusted and not working day adjusted – Consumer goods:
DataSetID= “STSIND_PROD_M”
13
FREQ="M"
ADJUSTMENT="S"
STS_INDICATOR="PROD"
STS_ACTIVITY="NS0080"
STS_BASE_YEAR="2000"
And
 Turnover, domestic market (non-deflated) - Neither seasonally or working adjusted Manufacture of rubber products:
DataSet ID=” STSIND_TURN_M”
FREQ="M"
ADJUSTMENT="N"
STS_INDICATOR="TOVD"
STS_ACTIVITY="N12510"
STS_BASE_YEAR="2000"
In the DataWhere section it is possible to find these types of nodes:
 Time
 And
 Or
 Dimension
 Attribute
 Dataflow
The nodes Dimension, Attribute and Dataflow are all “simple nodes”: these types of nodes contain
only a value.
The nodes Time, And and Or are all “complex nodes”: these types of nodes can contain other
complex or simple nodes. Particularly a Time node can contain only two child nodes: “StartTime”
and “EndTime”, while And and Or nodes can contain all types of nodes.
The idea is to transform a complex SDMX Query into a simpler one, applying some rules of
Boolean logic:
 the following hierarchical expression:
Node1 And Node2 And (Node3 And Node4) And (Node5 And Node6)
can be transformed in:
Node1 And Node2 And Node3 And Node4 And Node5 And Node6

the following hierarchical expression:
Node1 Or Node2 Or (Node3 Or Node4) Or (Node5 Or Node6)
can be transformed in:
Node1 Or Node2 Or Node3 Or Node4 Or Node5 Or Node6

the following hierarchical expression:
Node1 And Node2 And [(Node3 Or Node4) And (Node5 Or Node6)]
can be transformed, using the Cartesian product, in:
(Node1 And Node2 And Node3 And Node5) Or (Node1 And Node2 And
Node4 And Node5) Or (Node1 And Node2 And Node3 And Node6) Or
(Node1 And Node2 And Node4 And Node6)
14

the following hierarchical expression:
<Time>
<StartTime>Date 1</StartTime>
<EndTime>Date 2</EndTime>
</Time>
can be transformed in:
<And>
<StartTime>Date 1</StartTime>
<EndTime>Date 2</EndTime>
</And>
The SDMX Query is processed by the following steps:
 rewrites the XML stream using Boolean logic rules, and adding to each node an attribute that
acts as a “unique key”. The scope of the unique key is to set a hierarchy between “parent nodes”
and “child nodes”;
 converts the XML stream in a memory tabular data structure whose columns represent all types
of nodes (Time, Dimension, Attribute, Dataflow) except for And and Or;
 re-organizes the in-memory tabular data structure so that all columns have an And relation, and
all rows have an Or relation;
 converts the in-memory tabular structure in a SQL query.
15