Download - WGS Data Management Planning course

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Not our data, but we use it in research
Wietse Dol, Wageningen Economic Research ([email protected])
6 February 2017: 10.45 - 11.30, Forum C0106
Wietse Dol
 PhD Econometrics
 10 years University of Groningen (Econometrics, sampling theory)
24 years Wageningen Economic Research (LEI)
(many different departments)
 Data and models, i.e. use/reuse and quality, trouble
shooter + statistical methods + ICT + user interfacing
 Not an IT specialist but a researcher (I build tools
because I use it myself and like to share with others)
 Many model projects and user interfaces for models (not
only WUR/LEI)
 Since 2006: data, data quality ≡ MetaBase
Wageningen Economic Research
 Part of Wageningen University & Research center (WUR)
 Part of the Social Science Group within the WUR
 We are the research part of WUR/SSG (advice ministry
of Economic Affairs) in The Hague
 Consultancy (applied research): ministries, EU, local
government, industry,…
 Collecting data (Farm data: FADN), building models and
agricultural content specialists
University vs. Research center
 University: teaching, publications, new theory and
technology
 Research center:
● applied work/consultancy
● reusing things from the past (e.g. yearly publications)
● sharing knowledge (how to become a content
specialist)/teaching for small groups
● working in groups (different disciplines)
● Working in (inter)national groups with many different
disciplines
Research centers have experience in data management.
Primary vs. Secondary research data
Research data: collected, observed, or created, for the
purpose of analysis to produce and validate original
research results.
 Primary data: you collect, targeted to answer/validate
your questions.
 Secondary data: not yours, e.g. from website, FAO,...
• More and more need of secondary data (primary is
expensive and takes a lot of time to collect).
• Quality of data
• Meta-information and Checking and Versioning is crucial
Production data
Product
Country Year
Production
Tomato
NL
2005
325
Wheat
BE
1999
100
Sugar
FR
2003
450
Meta-information: Source, Version, Dimension, Definitions
etc. without proper information you use the wrong data
 is FR with or without DOM?
 Is the production in tons or in Euros.
 Does the year start 1-1 and ends 31-12?
 What’s the definition of Tomato
 Owner of the data/Version of the data/conditions usage…
Lifecycle Model of data
http://www.dcc.ac.uk/resources/curation-lifecycle-model
Data
 Use data
 How to get the data, filter it and store it
 Inspection and Quality checks on the data
 How to make it available for others
 What scientific actions are done on the data
 Curate, preserve, versions, … Lifecycle Model
Don’t do it alone, do it as a GROUP and
communicate
Everybody
Not often
Seldom
Types of databases according MetaBase
 Statistical database
 Scientific database
 Meta-database
Statistical database: secondary data
Data(bases) provided by international organizations like
EU, FAO, OECD, World bank are in general statistical
databases:
● Good web interfaces for downloading data
● Data are stored as they are received
● Data are consistent in their own domain
● No aggregations are made when underlying data
are missing
● Not much attention for data checking
● No versioning system (data changes)
Scientific versus Statistical database
 Problems with statistical database:
● Different definitions of territories and commodities
● Typing errors
● Missing data
● Break in series/new definitions
 Scientific database:
● Problems solved
● Transparency (original data sources and underlying
assumptions are kept)
● Versioning of the data
● Essential for modeling and research
Structural design of a scientific database
 Key words for structural design
HarDFACTS project IPTS 2007 done by vTI/LEI
● Transparent
● Harmonised
● Complete
● Consistent
Harmonised Database for Agricultural Commodity Time
Series
=> The amount of effort/costs scares institutes but it
is often a “hidden” costs.
Transparent
 Original data from statistical database are stored
 Complete and consistent data are stored
 Original and completed data can be compared
 Calculation procedures are stored and can be repeated
(scripting language)
Harmonised
Definition used here is to bring together the different
international databases in one framework and to link the
data through a unique coding system (keywords are
classifications and tree structures, super-classifications)
Complete
Definition used in MetaBase is that an econometric
procedure will be proposed to complete the new (time)
series in the database (especially needed for models).
Consistent
Definition used here is that the inter relationship of the
data in the database holds over classifications (time,
territories and variables).
Example: sum of all area used for different crops should
be <= total area.
Indicators: use two or more datasets to calculate a new
one that can be compared.
Consistency example (Eurostat data):
we have two datasets:
1. Export volume (tons) slaughter cows
2. Number of exported slaughter cows
What can we expect from 1. divided by 2.
Export volume (tons) slaughter cows/Number of exported slaughter cows
= average weight of exported slaughter cow
Should be reasonable the same over countries…
Versioning of your research
Main reason for versioning: Reproducibility
 Software you use changes: software versions
 Data changes/is updated/corrected: data versions
 You discover errors in your research process or you
improve the procedure: model versions
Best advice: do not use a spreadsheet but a system
with a scripting language (SQL, R, GAMS,…) and store
data in a database (with a good data model). This
documents how the original data was transformed into
the data of your research
Store data and scripts in a version control system SVN: like
Turtoise http://tortoisesvn.net/ or GitHub: https://github.com/
Do it as a group and (re)use others results.
Versioning 2
 Try to separate Model (script) from Data
 Make generic scripts when possible (re-use)
 Store Script and Data in separate SVN repositories
 Add meta-information to data as well as your scripts
 I.e. register versions of the software you use
 Test if your data and code also runs on other computers
Example: Outlier testing in MetaBase
Land under permanent crop in Spain by Eurostat
Versioning 3
 Versioning looks time consuming, but when you make
mistakes it is easy to go back to an old situation. It is
also a first good step in sharing data etc. Works very
well in groups.
 Easy to see differences between versions.
 Versioning makes it possible to reproduce research, also
in 5 years time.
 Frequency of versioning: some make a version every
day. Practical advice: make a version when you have a
publication.
MetaBase: data management for data
MetaBase
1. many different data sources (e.g. FAO, Eurostat) all in
same user-interface (SDMX, NetCDF)
2. find data alternatives using Meta-Information
3. search data content (e.g. oilseed)
4. all content easily available in research software
5. recodings, aggregations and concordances are all
implemented in GAMS
6. Statistical methods in GAMS and R
7. Versioning Eurostat, FAO, CBS, World Bank (every day)
8. Example: http://www.agrimatie.nl/
Population size (and prediction)
 FAO has a nice dataset on Population sizes
Trends
 Power Pivot/Power Query/Power View in
Excel & Share Point (data reporting analysis tools for MS)
 MS SQL-server and R
 R and Shiny: making user-interfaces and do stats/graphs
 MS Power BI (get data from anywhere, add relations and
make dashboards)
Always play with
your data
and communicate,
share data
knowledge
Wishes, problems, requests: [email protected]