Download 2. SDMX Istat framework

ISTAT SDMX Framework Submitted by Francesco Rizzo, Laura Vignola, Dario Camol, Mauro Bianchi National Institute of Statistics (ISTAT) Rome, Italy 1. THE REASON WHY ISTAT IS STARTING TO USE SDMX 3 2. SDMX ISTAT FRAMEWORK 3 2.1 Short-term time series database (ConIstat) 2.1.1 Check and Loader software modules 2.1.2 ConIstat Web Navigator software module 4 5 6 2.2. Changes and new modules 2.2.1 Database changes 2.2.2 Extensions in the Check and Loader modules 2.2.3 The web service module 2.2.4 SDMX Web Navigator 2.2.5 Manager and web navigator reference metadata module 4.1 A GENERALIZED SDMX QUERY PARSER MODULE (BETA VERSION) 7 7 8 9 11 12 13 2 1. The reason why Istat is starting to use SDMX Istat is following the evolution of the SDMX initiative with interest taking the following two ways into account: 1. tactically, through the participation in the EUROSTAT SODI initiative; 2. strategically, through the building of a working group whose main aim is to analyze and verify the use of SDMX in the internal architecture of the Istat Information System. The actual Istat Information System is supported by a distributed architecture. Several production Directorates operate through local sub-systems that, independently, cover the full life cycle of statistical data, from data collection to data dissemination. So Data and Metadata dissemination are currently located in different database maintained by different production Directorates. To access these sub-systems, users must use navigation interfaces that are completely different from each other. The need to uniform the data search using the same interface has become a very big requirement. To achieve this objective Istat is involved in developing an INTEGRATED OUTPUT MANAGEMENT SYSTEM. This is an Information System oriented towards the integration of part of the life cycle of statistical data of the Institute, with particular emphasis placed on data dissemination. The experience that we are acquiring participating in the SODI project will allow us to support the strategic interest at SDMX in Istat. In order to facilitate this objective, we are developing a framework consisting of various compatible SDMX software modules. The framework could be used entirely from the reporting phase to the dissemination phase, or alternatively using the modules separately, integrating them into one Information System. In the future it might be possible to distribute the framework under a Public Licence. This document describes the architecture and the modules of the SDMX ISTAT Framework version 1.0 2. SDMX Istat framework Istat is taking part in the SODI project with an internal working group made up of one analyst and four programmers. The project is divided in the following phases:  studying of SDMX;  reporting analysis – to know where data is stored, particularly STS and ESA data. We discovered that most of this data is stored in ConIstat database (shot-term time-series database, visible at the URL: http://con.istat.it).  analysis and design – to describe which existing modules must be extended and which new modules must be made. In particular the system must perform the following functions: o extend data reporting in order to integrate the existing database with other data required by the SODI project; o mapping between already existing data structure and DSDs defined by Eurostat; o provide a web service capable of accepting a SDMX Query and responding with a SDMX Compact; o provide a RSS file that informs when new data is loaded or updated, which then specifies the URL where to find the files containing the SDMX Query which then locates the new or updated data; o provide a web client program which can be connected to web service. This client is able to query the database using the statistical concepts defined into DSDs. As well as visualizing the results of the search, in XML and HTML format, it is able to visualize the SDMX query that will then be sent to the web service. 3 The following diagram shows the existing system (ConIstat) and new modules added to satisfy the SODI project. Reference metadata manager Reference metadata RSS Reference metadata web navigator Reference metadata SDMX Query Reference metadata web service Reference metadata database Word extractor STS SDMX web navigator STS ESA mapping sodi.istat.it ESA SDMX data web service time-series database sodi.istat.it/sodiWs/service1.asmx .ist gesmes ConIstat web navigator RSS builder & query builder con.istat.it .dat Check Loader ConIstat modules sodi.istat.it/query/*.xml sodi.istat.it/RSS/rss.xml new modules not complited modules 2.1 Short-term time series database (ConIstat) The Istat Short-term time series database stores 16.000 time series. The following statistical subjectmatter domain list shows how these time series are organized:  Price o Consumer price o Producer price  Services o Retail trade sales o Tourism o Other services  Employment, wages and other labour indicators o Large firms labour indicators 4      o Large firms labour indicators (seasonally adjusted data) o Large firms labour indicators (data adjusted for calendar effects) o Quarterly indicators on employment, earnings and social security contribution Construction o Production in construction Industry o Turnover o New orders o Industrial production o Orders stock Foreign trade o Imports o Balances o Exports o Exports in Italian regions Labour force o Labour force o Employment o Unemployment o Seasonally adjusted data o Rates o Non labour force o Population Quarterly National Accounts o Income statement of jobs and resources fob-fob o Final consumption of households o Costs and margins o Gross fixed capital formation o Imports and exports fob-fob o Labour units and income o Value added at basic price o Value added at produced price o Contribution to percent change in GDP The ConIstat system consists of the time series database and some software tools that cover all the data flow from reporting to dissemination. 2.1.1 Check and Loader software modules Production Directorates organize updating data in a fix formatted record file (.dat). Look at this example of a file for the Production Price: ppa19grCB ppa19grD ppa19grDA ppa19grDG ppa19grE ppa19grCB14 200611 200611 200611 200611 200611 200611 119,9 114,8 111,8 114,5 149,3 119,9 5 Check and Loader software modules before the extension Check Loader Intermediate fix formatted record file Fix formatted record file time series database Dissemination fix formatted record file The person responsible for sending data for a particular production Directorate, uses the Check module to prepare a “checked file” free from some type of errors. Generally the “.dat” files are prepared automatically by the production Directorate sub-systems, but occasionally they could be prepared manually, so a check process is necessary. This “checked file” is sent to a centralized structure which provides loading, through the Loader module, into the “time series database”. The Loader module besides loading data in the database, prepares “Dissemination fix formatted record file” (one for each sub-domain involved in the update). Look at this example of a fixed formatted record file for the Production Price: 0040 80,2 80,2 …… 80,1 1991/1 1991/2 …… 1991/8 0050 77,5 77,8 ……. 79 0060 80,7 80,9 …… 82,2 C 82,6 82,9 …… 83 CA 90,2 90,5 …… 86,8 The “Dissemination fix formatted record files” were a specific requirement of some users that need to load their database with the uploaded data of the Istat time-series database. 2.1.2 ConIstat Web Navigator software module This module is a web application that allows one to search for data using a complex query. A user can simultaneously manage time-series from different statistical domains with different frequencies. Look at this extracting example of a monthly and quarterly time series, coming from different statistical domains: eif13.C Industry; turnover; national turnover index; Mining and quarrying kflfr.IFOL Continuous Labour Force Survey; Labour Force; by Geographical Area and Sex; Italy Female eif13.C 2006/1 284,1 2006/2 85,7 2006/3 331,8 2006/4 282,9 2006/5 235,4 2006/6 181,0 2006/7 177,6 2006/8 150,9 2006/9 170,3 kflfr.IFOL 9.923 9.962 9.795 2006/10 196,5 2006/11 251,3 2006/12 237,1 2007/1 10.006 277,0 6 2.2. Changes and new modules As decided in the design phase, some of the existing modules were extended, others were created and a mapping operation, between already existing data and the DSDs defined by Eurostat were made. 2.2.1 Database changes The need to describe the time series in ConIstat database, using the DSDs defined by Eurostat, has involved some changes in the database schema. The ConIstat database schema consists of the following main tables:  METADATA, stores a set of structural metadata;  DATA, stores the observations for each time series;  a group of lookup tables, that stores code lists. In the METADATA table each row depicts a time series: each column depicts a statistical concept (dimension or attribute). Particularly the fields Domain and SubDomain perform the same role of the second and third level in the “statistical subject-matter domain” list, described in the “SDMX content-oriented guide”. The fields Category, Type, Vs, ClassificationCode, Freq, and Um refer to the statistical concepts (dimensions and attributes) that allow to distinguish one time series from an other. Look at a simplified diagram of the ConIstat database schema:  Primary Key DOMAIN SUBDOMAIN CATEGORY CODE_LIST METADATA Domain SubDomain Category Type Vs ClassificationCod Freq Um Start-Time_Period End-Time_Period DATA Domain SubDomain Category Type Vs ClassificationCod Time_Period Value The changes that have been performed in the ConIstat database schema consist of:  creating two tables. One, named STS_METADATA, used to describe STS indicators and one, named ESA_METADATA, used to describe ESA1 and ESA2 indicators. These tables inherit, the base structure from METADATA. After that, in each table a group of columns is added, whose number depends on statistical concepts used to define DSDs. In the future it will be necessary to add other tables to store time series from other domains;  creating some lookup tables: CONCEPTS, CODE_LIST and DATAFLOWS;  creating a table, named KEY_FAMILIES that stores the features of each DSD. From this table, it is possible to know which table stores the structural metadata for each domain. Look at a schematic diagram of the database after the changes: 7  Primary Key Domain SubDomain Category Type Vs ClassificationCod Freq Um Start-Time_Period End-Time_Period Old_Code_List New_Code_List Concepts DataFlows METADATA DATA Domain SubDomain Category Type Vs ClassificationCod Time_Period Value STS_METADATA ESA_METADATA fk_Dataflow fk_Freq fk_Adjustment fk_sts_Indicator fk_sts_Activity fk_Base_Year fk_Dataflow fk_Time_Periodicity fk_Prices fk_Field1 fk_Field2 fk_Field3 fk_Field4 fk_Unit Fk_Unit_Multiplier fk_Adjustment_Type fk_Adjustment_Method fk_Unit fk_Unit_Mult fk_Decimals fk_Collection fk_Availability fk_TimeFormat KEY FAMILIES KeyFamilyName ConceptName ConceptType AttachmentLevel FieldName TableName Fk_Dataflow Already existing After changes Look at this example describing a record in the METADATA table and the corresponding mapping, using a DSD: Domain: e – Industry SubDomain: if – Turnover Category: 13 – National Turnover Index ClassificationCode: 0090 Type: g -Neither seasonally or working day adjusted Um: pe – Index number (base 2000) Mapping: DatFlow: STSIND_TURN_M Frequency: M Adjustment: N sts_Indicator: TOVD sts_Activity: NS0090 sts_Base_Year: 2000 Time_Format: P1M 2.2.2 Extensions to the Check and Loader modules Check and Loader modules were extended to include a new data reporting function and a new dissemination function. 8 Check and Loader software modules after the extensions Intermediate fix formatted record file Fix formatted record file Check time series database Loader Dissemination fix formatted record file GESMES formatted file SODI RSS SDMX Query SDMX Compact The new data reporting function (beta testing) allows one to collect data organized in GESMES format. This function will allow to collect data – not yet in the database – but already available in the production Directorates. The new dissemination functions have been developed to satisfy the SODI requirements:  to publish a RSS file that informs when new data is loaded or updated, which then specifies the URL where to find the files containing the SDMX Query which then locates the new or updated data;  to publish one or more SDMX Query file(s);  to publish one or more SDMX Compact file(s) (optional). The SDMX Compact files have been created in the event that the response-times for extracting an entire dataflow, on line, are too long. In this case instead of querying the database, the web service will directly take the already prepared SDMX Compact file. Logically, in this case, it is not possible to filter a specific data extract, but one must extract an entire dataset. 2.2.3 The web service module The core of the SDMX Istat Framework is the “SDMX data web service” module. It allows the use of the Pull exchange method to request data. This web service can be located at the following URL: http://sodi.istat.it/sodiWS/service1.asmx. A client software can request data from the “SDMX data web service” by sending a SDMX Query. The client receives data in SDMX Compact format. The “SDMX web service” implements the functions described in the following diagram: SDMX Query XML validate SDMX Compact query Parser SDMX Compact builder sql query builder Database 1. reads the SDMX Query stream; 2. verifies the XML format against a XML schema: 3. parsers the SDMX Query and decomposes it in elementary parts; 9 4. 5. 6. 7. constructs a SQL query with the same meaning of the SDMX Query; executes the SQL query; reads the response of the database and creates a SDMX Compact stream; sends the SDMX Compact stream to the client. At the moment the “SDMX data web service” accepts SDMX Queries which contain only the “DataWhere” section and allows queries regarding an entire Dataflow or its subsets. Look at this example of a SDMX Query that specifies how to extract the entire Dataflow of “neither seasonally or working day adjusted – monthly - Industrial Production”: <Query> <query:DataWhere> <query:And> <query:DataProvider>ISTAT</query:DataProvider> <query:Dataflow>STSIND_PROD_M</query:Dataflow> <query:Dimension name="FREQUENCY">M</query:Dimension> <query:Dimension name="INDICATOR">PROD</query:Dimension> <query:Dimension name="ADJUSTMENT">N</query:Dimension> <query:Time> <query:StartTime>2006-01</query:StartTime> <query:EndTime>2007-01</query:EndTime> </query:Time> </query:And> </query:DataWhere> </Query> Look at this example of a SDMX Query that specifies how to extract a subset (only NS0040, N100CA, N11100 activities) of “neither seasonally or working day adjusted, seasonally adjusted and working day adjusted – monthly - Industrial Production” <Query> <query:DataWhere> <query:And> <query:DataProvider>ISTAT</query:DataProvider> <query:Dataflow>STSIND_PROD_M</query:Dataflow> <query:Dimension name="FREQUENCY">M</query:Dimension> <query:Dimension name="INDICATOR">PROD</query:Dimension> <query:Time> <query:StartTime>2006-01</query:StartTime> <query:EndTime>2007-01</query:EndTime> </query:Time> <query:Or> <query:Dimension name="STS_ACTIVITY">NS0040</query:Dimension> <query:Dimension name="STS_ACTIVITY">N100CA</query:Dimension> <query:Dimension name="STS_ACTIVITY">N11100</query:Dimension> </query:Or> </query:And> </query:DataWhere> </Query> We are working on a new version of the “query parser” sub-module. It will be able to interpret a SDMX Query, referring to different Dataflows or subsets of different Dataflows. During design and prototyping phases we took into consideration the response-time of the “SDMX web service” when a client submitted a query that involved a large amount of data. The algorithmic query was optimized and we got response-time under 20 seconds in the following test: we requested the “SDMX data web service” an entire Dataflow containing 685 time series, 120 months long, developing 82.200 observations. 10 2.2.4 SDMX Web Navigator This module is a web application that acts as a client towards the “SDMX data web service”. This module was created with the intention to test the web service, using a graphic interface. Then we converted it in a real web navigator adding data presentation functions. In general, using this module, it is possible to carry out the following functions:  querying database using the DSDs as analysis dimensions;  building SDMX Queries using a graphic interface;  testing SDMX Queries. The following image shows the interface used to query the database: From this graphic interface a user can build a SDMX query choosing the Dataflow, that contains the interested data, and then setting up some filters through the Dimensions and the time period. Then user can choose if visualize or save the SDMX Query or send the query to the web service and visualize or save the resulting SDMX Compact file. The following image shows the “query tester”, a user graphic interface that allow one to write a SDMX Query by himself and send it to the web service: 11 2.2.5 Manager and web navigator reference metadata module The SODI project, as well as data and structural metadata exchange, deal with reference metadata exchange. For this objective we are developing a software module (in beta version) as part of the SDMX Istat Framework, that allows automating the production and disseminating of this type of metadata. The production Directorates, send Eurostat and IMF MS-Word files containing reference metadata in a self-governing way through the use of templates. The idea was not to modify current working ways, but to provide some software tools that help improve these working ways. The following diagrams show the process to produce reference metadata before and after the introduction of the this module: Reference metadata production before using the Framework HTML Eurostat template IMF template Production Directorate s Istat Web site Eurostat IMF 12 Reference metadata production after using the Framework Databas e Production Directorate s Reference metadata manager XSLT XML Reference metadata web navigator Reference metadata web service Word extractor HTML SDMX Eurostat IMF During the analysis phase we compared the information request by Eurostat and IMF through their “word” template. We put a list of information together considering the requests in both cases and we designed a database schema that was able to store all the information. Then we designed the following sub-modules:  reference metadata manager (beta testing)  reference metadata web navigator (beta testing)  word extractor (beta testing)  SDMX reference metadata web service (under construction) The person responsible for filling in the templates can now use the “reference metadata manager” sub-module. This sub-module, through a web graphic interface, allows the storage of all the necessary information regarding reference metadata for a particular dataset. The information flow in this sub-module is organized by means of documents. A document contains all the reference metadata for a particular dataset in a particular time period and can be in the following states:  on working - the document is not complete;  posted - the document is complete and ready to be sent to Eurostat and/or IMF. For a dataset, it is possible to have different documents made at different times (for example one for each year or one for each quarter). This sub-module allows the following:  start a new session from scratch or recall an already “posted” document;  insert information about reference metadata filling in the text fields linked to each level;  interrupt the session and continue it in a different moment;  “post” (make definitive) the document. The “Word extractor” sub-module allows one to export the information stored in the database in word format files in accordance with the templates. So the production Directorates can use this module to produce, also, the files that they must send to Eurostat and IMF. 4.1 A generalized SDMX Query parser module (beta version) As specified in paragraph 2.2.1 the “SDMX data web service” accepts SDMX Queries which contain only the DataWhere section and permits queries regarding an entire Dataflow or its subsets. This module continues to accept only queries with DataWhere section, but it also accepts queries regarding different Dataflows and subsets of different Dataflows. This module permits to extract at the same time, for example, the following time series:  Industrial production - Seasonally adjusted and not working day adjusted – Consumer goods: DataSetID= “STSIND_PROD_M” 13 FREQ="M" ADJUSTMENT="S" STS_INDICATOR="PROD" STS_ACTIVITY="NS0080" STS_BASE_YEAR="2000" And  Turnover, domestic market (non-deflated) - Neither seasonally or working adjusted Manufacture of rubber products: DataSet ID=” STSIND_TURN_M” FREQ="M" ADJUSTMENT="N" STS_INDICATOR="TOVD" STS_ACTIVITY="N12510" STS_BASE_YEAR="2000" In the DataWhere section it is possible to find these types of nodes:  Time  And  Or  Dimension  Attribute  Dataflow The nodes Dimension, Attribute and Dataflow are all “simple nodes”: these types of nodes contain only a value. The nodes Time, And and Or are all “complex nodes”: these types of nodes can contain other complex or simple nodes. Particularly a Time node can contain only two child nodes: “StartTime” and “EndTime”, while And and Or nodes can contain all types of nodes. The idea is to transform a complex SDMX Query into a simpler one, applying some rules of Boolean logic:  the following hierarchical expression: Node1 And Node2 And (Node3 And Node4) And (Node5 And Node6) can be transformed in: Node1 And Node2 And Node3 And Node4 And Node5 And Node6  the following hierarchical expression: Node1 Or Node2 Or (Node3 Or Node4) Or (Node5 Or Node6) can be transformed in: Node1 Or Node2 Or Node3 Or Node4 Or Node5 Or Node6  the following hierarchical expression: Node1 And Node2 And [(Node3 Or Node4) And (Node5 Or Node6)] can be transformed, using the Cartesian product, in: (Node1 And Node2 And Node3 And Node5) Or (Node1 And Node2 And Node4 And Node5) Or (Node1 And Node2 And Node3 And Node6) Or (Node1 And Node2 And Node4 And Node6) 14  the following hierarchical expression: <Time> <StartTime>Date 1</StartTime> <EndTime>Date 2</EndTime> </Time> can be transformed in: <And> <StartTime>Date 1</StartTime> <EndTime>Date 2</EndTime> </And> The SDMX Query is processed by the following steps:  rewrites the XML stream using Boolean logic rules, and adding to each node an attribute that acts as a “unique key”. The scope of the unique key is to set a hierarchy between “parent nodes” and “child nodes”;  converts the XML stream in a memory tabular data structure whose columns represent all types of nodes (Time, Dimension, Attribute, Dataflow) except for And and Or;  re-organizes the in-memory tabular data structure so that all columns have an And relation, and all rows have an Or relation;  converts the in-memory tabular structure in a SQL query. 15

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2. SDMX Istat framework