Download Data Structures for Time Series from Ocean Moorings:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Structures for Time Series from Ocean Moorings:
A Proposal for NVODS Aggregation Servers
1. Introduction
Data from moored oceanographic instrumentation has been made available for access
over the Internet by a small number of providers using OPenDAP (DODS) servers. This
data usually consists of time series of such quantities as current velocity, water
temperature and salinity, bottom pressure, travel time from inverted echo sounders, and
could include surface meteorological data. Future developments in instrumentation may
allow some of the chemical and biological properties of the ocean to be measured by
unattended moorings deployed for periods of order months. Examples might be oxygen,
nutrients, chlorophyll-A, and transmissivity. One of the goals of NVODS is for the user
to search multiple OPenDAP sites in order to retrieve data specific to a given problem or
analysis. This means that the user would be unaware of the locations of the data and the
specifics of its organization and formatting.
In keeping with NVODS philosophy, individual providers select data structures, variable
and metadata naming conventions according to their local requirements and make them
available over the web. The OPenDAP protocols map these structures to NVODS data
structures (sequence, grid, nested structures), but keep the variable and metadata names
of the original data files. The majority of ocean time series providers use netcdf
(http://www.unidata.ucar.edu/packages/netcdf/) files to store their data and use DODS
netcdf servers to make the files available. Netcdf is designed to store arrays (termed
Grids in NVODS), variables and their associated attributes (e.g. units, long_name, etc.) in
self-describing structures. It can accommodate multiple data arrays, but they usually
have the same dimensions. Netcdf does not accommodate easily sequences of complex
data structures. For example, elements of netcdf arrays are of one type (float, double,
integer, etc.) and if data, would have the same units. A programming analogy is that
netcdf is closer to a language that primarily manipulates arrays, like Fortran, than one that
uses structures extensively, such as C. Because netcdf is extensively used for storing
time series data, there are a number of efforts to standardize names for dimensions and
attributes (e.g. COARDS; NetCDF Climate and Forecast (CF) Metadata Conventions:
http://www.cgd.ucar.edu/cms/eaton/cf-metadata/), and suggest required names and units
for metadata associated with ocean datasets (e.g. MBARI, Coriolis Project). Netcdf files
are the basis of, or can be input into, some widely used analysis packages (e.g. PMEL’s
EPIC (http://www.epic.noaa.gov/epic/), MATLAB).
Despite these standardization efforts, a user attempting to use time series datasets from
multiple NVODS sites for input into an analysis would have two main difficulties:
1) Finding the locations (URL’s of the files) of the data would be time consuming
and difficult because of lack of easily searchable databases of site contents
(metadata).
Version 1.0
February 2004
1
2) Dimension, variable, attribute and data structure conventions would not be
consistent across sites even when filtered through OpenDAP. At present, the user
needs to adapt his or her API for each dataset retrieved.
These two points are related because consistent names and databases of metadata will
make searching for datasets across multiple sites easier, as well as simplifying their use.
This report is concerned mostly with proposing a structure for aggregating time series
datasets, obtained by an aggregation server from individual NVODS sites, which would
be returned to the user. Searching for, and requesting datasets are closely related
activities, and it is suggested that constructing a metadata database for locating datasets,
which is consistent with the aggregation data structure and metadata returned to user, is
an important consideration. Such a metadata database would probably be implemented
as relational database tables because of their wide spread use for organizing large
amounts of complex information, and the availability of software to do searches over the
web. A data structure is independent of its naming conventions, however, because
standard names for attributes as well as determining which attributes should be present in
a dataset, are important to the user, this report will make some suggestions on these
topics.
The author of this report initially developed a small number of strawman proposals which
were submitted to an ad-hoc group (see Appendix A), mainly drawn from people directly
involved in distributing ocean data to users. A two-day workshop was held in
Charleston, SC (January 13 and 14, 2004), hosted by NOAA’s Coastal Services Center,
in which the proposals and related issues were discussed and modified. This report
reflects the conclusions of the workshop and has been reviewed by workshop members.
The author, however, is responsible for final content of the report.
The goals of the workshop, established by the participants, were:
1) The data model will be used by an aggregation server to format output for use by
a technical person or scientist to produce products.
2) The data model will express all the relevant information regarding a time series.
3) The aggregation server will convert provided data to the data model. Valid time
series data values will not be modified except for scaling.
4) Try to define entities and terminology for single and multiple time series.
Thus, the primary focus is on the user’s requirements, which can be different from those
of the data provider. The latter is often concerned with issues such as instrument
performance, QA/QC, calibration equations, details of mooring designs, data
management of real-time measurements, etc. These may be of peripheral concern to the
analyst-user who would often like to easily determine if the retrieved time series could be
input into a complex multi time-series analysis method. Thus, knowing that a time series
is “clean” (i.e. equally spaced, no gaps, faulty values removed and interpolated, etc.) is
often more useful than details of the processing required to get from the original data
extracted from the moored instrument to the “clean” series. It is expected that the user
Version 1.0
February 2004
2
can always return the providers site, if questions arise on the data processing or
instrument calibration, for example.
If the data structure proposal is implemented in ocean time series aggregation servers,
then the ideal would be that they would be the first choice for searching and retrieval of
this type of data. Thus, standards imposed by aggregation data structures may have a
chance to become established and preferred, which will simplify life for the user. Except
for specialist use, the original providers become “hidden” from the user, but they gain in
that their data may be more widely used with little change in their present practices.
This report is organized as follows: The basic considerations that were used in
constructing the proposals are given in Chapter 2. The characteristics of ocean time
series data, restrictions on aggregated data, the present state of OPenDAP providers, and
naming conventions are discussed. In Chapter 3, the proposal is given and discussed,
including suggestions for metadata that should be present in the aggregation data
structure. Chapter 4 summarizes the report and makes recommendations.
2. Background
2.1.
Ocean Mooring Based Measurements
The primary focus of this report is the organization of datasets that result from
deployments of moored instruments in the ocean. In this context, moorings are
considered to be any platform that has nominal fixed location (latitude, longitude) in the
ocean. Moorings can be configured with sub-surface or surface flotation (buoys), or be a
fixed platform on the bottom. Surface buoys may be equipped with meteorological
instrumentation (for wind velocity, air temperature, barometric pressure, etc.). Examples
of a conventional sub-surface taut-line deepwater mooring, and a set of three shelf
moorings (bottom tripod, thermistor chain with surface flotation, and a subsurface
mooring) designed to make measurements of the water column at one site, are given in
Figures 1 and 2, respectively. As indicated in these figures, moorings can contain a
variety of measurements including instruments that acoustically profile through all or part
of the water column. Examples of the latter include the Acoustic Doppler Current
Profiler (ADCP) and the Inverted Echo Sounder (IES).
A mooring site may be occupied for periods of several weeks to many years. During this
time a mooring may be retrieved and redeployed (i.e. serviced) many times, and it is
often the case that a re-deployment is not at exactly the same position as the previous
deployment(s). This may cause the nominal depths of the instruments to change slightly.
Instrumentation may also change between deployments for various practical reasons such
as replacement of a failed sensor. Servicing of the moorings creates gaps in the
measurements at a site of anywhere between a few hours and a few days. Even though a
mooring is anchored to the bottom, the upper flotation can be displaced horizontally
(watch circle) and, for sub-surface moorings, vertically (draw down), by current flows
and winds. Therefore, location of an instrument (latitude, longitude, depth) can change
Version 1.0
February 2004
3
with time though deviations from the nominal location are expected to be small. In some
cases these deviations are important to the analyst, however, it is more usual to consider
the nominal position (particularly depth) as adequate measurement of the instrument’s
location. For example, a user will often sacrifice a few meters uncertainty in the depth of
an instrument, caused by changes in water depth between deployments, in return for a
longer time series to input into an analysis. Institutions handle the identification of
multiple deployments of moorings at a site in different ways. Some assign separate ID’s
for each deployment of the mooring (and instruments). Others assign ID’s to the site and
instrument locations and indicate the deployment through other attributes of the time
series. The latter makes the concatenation of time series from instruments, over a number
of deployments, more straightforward in that new ID’s do not need to be created for the
concatenated data.
Moored instruments are generally designed to operate unattended for long periods (~
months to years), and record their data internally. Some moorings have the capability of
transmitting the data from the instruments to shore in real-time using satellite, VHF radio
or telecommunications networks. The latter are the basis of the Coastal Ocean Observing
Programs around the country. The majority of instruments record data at equal time
increments (e.g. at 15 minute intervals). However, a few use variable or adaptive time
sampling schemes. For example, a wave-tide gauge may measure bottom pressure
averaged over a few minutes, every 30 minutes, and every three hours collect a burst of
rapidly sampled 1 Hz data. The former would be used for tidal analysis of sea level, and
the latter for surface waves. In an adaptive sampling scheme, the instrument may only
measure a parameter when certain thresholds are exceeded. For example, in a sediment
transport experiment, wave bottom pressure and currents may be only rapidly sampled
when the significant wave height exceeds a given value. Therefore, data model structures
must be able to accommodate data taken at varying as well as constant time intervals.
Profiling instruments, of which the most common are ADCP’s, measure parameters (e.g.
current velocity) at multiple depths remote from the device. Thus, the depth of the
instrument is not the depth of the measurements, but the parameters in each depth bin are
recorded at the same time intervals. ADCP’s can be directed up or down and thus data
from bins at increasing distance from the head can be at decreasing or increasing depths,
respectively. This can affect the arrangement of the 2-D depth-time array in the time
series files of ADCP data, and again, different institutions have different conventions.
Some arrange by bin number irrespective of the direction of the head, others arrange by
depth of measurement values; positive direction may be up or down.
Most moored instrumentation measure a limited number of parameters. Examples
include temperature, conductivity, current velocity, pressure, meteorological variables,
and acoustic variables. In some cases derived variables are reported because they are
more useful. For example, a moored CTD instrument measures temperature and
conductivity, but usually the temperature and conductivity will be used to give salinity
and sigma-t. The travel times from an IES may be converted to dynamic height and
temperature/salinity profiles through the use of a complex model (GEM technology).
Time series data can also undergo various degrees of processing from, as given by the
Version 1.0
February 2004
4
instrument, to being gap-filled, filtered or averaged. It is not uncommon for sites to
supply several versions of the data from a single instrument. For example, a time series
may be given as the original 10-minute data and in hour-averaged form, usually in
separate files.
Because multiple devices are deployed on a mooring, it cannot be assumed that all
instruments are recording at the same time intervals or are synchronized in time. The
instruments may not have been started at the same instance and it is possible that one
instrument may sample at 20-minute intervals and another at 15 or 30 minutes. The latter
may arise because of limitations in storage or battery capacity of a particular instrument.
2.2.
DODS Moored Data Sites
At present, the majority of sites that offer time series data through DODS servers use
netcdf files as their underlying storage format. Netcdf also seems to be the preferred
format for new initiatives involving moored data such as the Coriolis Project based at
IFREMER. Use of netcdf does not necessarily imply that this is the working format of
the group supplying the data. In some cases, netcdf is used as an exchange format
because of its common use and the development of some standards for dimensions,
variable names and attributes. The netcdf DODS server software is also straightforward
to implement. With the recent availability of the relational database DODS server, use
of relational tables to store data and/or metadata may become another approach to
making data available through OPenDAP protocols. We are aware of one case where this
is being done (SCWRP), but this site is not accessible by the general public. Therefore,
the fundamental storage unit for time series at most sites is the array (grid in DODS
nomenclature) rather than nested structures of sequences. As part of this project, a survey
was made of the characteristics of moored time series data from sites listed in the DODS
catalog (http://www.unidata.ucar.edu/packages/DODS/). The results are presented in
Table 1. This table is not meant to be a comprehensive survey of all DODS moored data
sites, but rather an indication of the variety of content, conventions adopted, and formats
used by data providers, even though netcdf files are used by all for data storage.
The first and second columns identify the institution providing the data and the file
format used by the DODS servers. Note that three of these sites are associated with realtime Ocean Observing Systems (OOS) and the rest serve archive data. Filenames are
usually organized by instrument in that data in a particular file originated from a single
instrument’s data logger. There may be more than one type of data (e.g. current velocity
and temperature from a current meter) but the time (and depth if an ADCP) dimensions
are equal. Meteorological buoys usually keep data from all their sensors in a single file
since they usually have a common data logger. The Coriolis Project proposes to place all
instruments on a mooring in a single file organized by depth. Thus, variables are
arranged into 2-D (depth, time) arrays. This works best if the instruments on the mooring
are all similar (the Coriolis project is primarily concerned with moored CTD type
instruments), and data sampling across the instruments is synchronized.
Version 1.0
February 2004
5
Table 1: Characteristics of Moored Data Served by Institutions using NVODS
December, 2003
Institution
Files
Format
USGSWHOI
MBARI
Time Format
Content
Organized by
Netcdf
Instrument
EPIC time,time2
Netcdf
Instrument
COARDS (seconds since 197001-01 00:00:00)
WHOI
TAMU
SeaCOOS
Netcdf
Netcdf
(translation
of ASCII
Files)
Netcdf
Instrument
Instrument
EPIC time,time2
yy,mm,dd,hr,min,sec arrays
Instrument/
Buoy
COARDS (sec since 1995-01-01
NC-COOS
Netcdf
Instrument
GoMOOS
Netcdf
Instrument
SAIC
Netcdf
(translation
of Database
Files)
Netcdf
Instrument
(Max 2
Variables/File)
Netcdf
Mooring
NEFSCNOAA
IFREMEROTS
(proposed)
Version 1.0
February 2004
Instrument
00:00:00) Date_time string arrays
referenced to COARDS time also
provided.
COARDS (days since 0000-1-1
00:00:00 - MatLab datenum)
yy,mm,dd,hr,min,sec arrays
referenced to COARDS time also
provided.
COARDS (days since
-4713-01-01 00:00:00 [convertion
of EPIC time 0]) mm,dd,hr,min,sec
arrays referenced to COARDS time
also provided.
COARDS (minutes since yyyymm-dd hh:mn:ss) where yyyy-mmdd hh:mn:ss is start time of the
series.
EPIC time,time2
COARDS (days since 1950-0101 00:00:00)
Variable Conventions
Names
CFUnits
Standard
Names
Data Types
Scalars
Vectors
ADCP's
Met Buoys
No
EPIC
Codes
Some
COARDS
COARDS
Not
Standard
No
COARDS
Yes
E,N,w
No
COARDS
Yes
E,N
Depth Array
(deepest bin 1st)
Depth Array
No
No
COARDS
COARDS
Yes
Yes
E,N,w
Spd,Dir & E,N
Depth Array
No
No
No
Some
COARDS
Some
COARDS
No
No
No
Yes/NDBC
Some
COARDS
Some
COARDS
Yes
E,N,w
Depth (z) array
Yes
COARDS
Yes
COARDS
Yes
E,N & Spd,Dir
Depth Array
Yes
COARDS
No
COARDS
Yes
E,N & rotated
orthogonal axes
Depth Array
(shallowest bin 1st)
NDBC &
C-MAN
No
No
COARDS
Yes
E,N
No
No
No, EPIC
Codes
No
COARDS
Yes
No
No
No
6
This organization into (depth, time) arrays is probably too restrictive, if more general
time series were to be combined into a single file. It could result in non-uniform spacing
of data points in the common time dimension and thus, excessive number of missing data
flags.
There are basically two conventions for determining the time of the data. The older is the
PMEL/EPIC (http://www.epic.noaa.gov/epic/) specification for time as two integers. The
first (time) is the astronomer’s true Julian Day referenced to 00 hours GMT, and the
second (time2) is the number of milliseconds from midnight (00 hours GMT). It is noted
that using two variables to specify time has some limitations in the DODS Grid format
for Arrays. Grid maps the array indices to the dimension values if the variable has the
same name as the dimension as specified by COARDS. Thus, salinity(time) maps
time(time) but not time2(time), so not all the time information is included in the mapping.
The netcdf COARDS time convention uses a single variable but the reference (t=0) date
and the units are arbitrary. Almost everybody conforms to the Unidata UDUNITS
specifications for dimension and variable units. The reference date conforms to FDGC
specifications for date strings and can contain time zone information. If no time zone is
given, GMT is assumed. Thus, a number of different reference dates and time units are
used across the COARDS sites. Some sites pick a fixed reference date1, while others use
the start of the time series that changes with the file. TAMU uses separate arrays
(variables) for the year, month, day, hour, minute and second, and this kind of
specification is provided as secondary information at some sites. Time conversions from
one system to another will be an important function for aggregation servers.
The columns under “Variable Conventions” attempt to survey which netcdf conventions
sites implement. Variable and dimension names are usually unique to a site’s files.
However, dimension (time, depth, latitude, longitude) and equivalent independent
variable names often conform to COARDS recommendations and netcdf standard
attributes such as long_name, short_name, units, and _FillValue are nearly always
supplied for dependant variables. COARDS has been extended by Climate and Forecast
(CF) metadata conventions (http://www.cgd.ucar.edu/cms/eaton/cf-metadata/) that are
designed to standardize netcdf files for atmosphere, surface and ocean with modelgenerated data particularly in mind. CF does not place any restriction on variable names,
but rather standardizes and/or recommends use of attributes of the variables. It has some
useful recommendations for specification of attributes for coordinate variables and data
flags. However, perhaps the most useful recommendation is that all variables be
identified by a standard_name attribute, which has a precise (string) value. An example
is “sea_water_temperature”. Note that standard_name does not override long_name,
which is still used to describe the variable.
There are some limitations to the standard names as defined by CF, and one view,
expressed at the workshop, was that they were not very ocean measurement friendly. For
example, both “sea_water_temperature” and “sea_surface_temperature” are standard
names, whereas oceanographers usually consider “sea_surface_temperature” to be
1
The year –4713 corresponds to true Julian day zero.
Version 1.0
February 2004
7
“sea_water_temperature” at depth 0, and a file with temperatures at a number of depths,
including the surface, would not make a distinction. Similarly, defined standard names,
at present, make no provision for current (or wind) vector components that are not on east
and north orthogonal axes. Ocean chemistry measurements are also not accommodated.
However, these limitations could be probably fixed by input to CF by the oceanmeasurement community. Since CF metadata standards are relatively recent and
evolving, most existing sites have not implemented these conventions.
In the PMEL/EPIC (http://www.epic.noaa.gov/epic/) data analysis system, variables are
assigned integer codes (e.g. depth has an epic_code of 3). A few sites provide these in
their netcdf files and this could be another way of precisely identifying the meaning of
variable names, even though they are not intuitive to the casual user.
Under the “Units” column, COARDS means that all sites conform to specifying units
using the conventions established by the Unidata UDUNITS software package. The units
attribute should be specified for all dimensional variables as required by COARDS and
CF.
The types of moored data, provided by DODS sites, are scalars (e.g. temperature,
pressure, etc.), 2 and 3-D vectors (currents), profiles (ADCP currents), and
meteorological (winds, air temperature, humidity, etc.). Table 1 shows the data types
available by site. Current vectors may be in component form (East & North or axes
rotated to the direction of the isobaths), as speed and direction, or both. It is noted that
direction usually has a different meaning for currents than for winds, and thus vector
components may be less ambiguous. Some sites provide the vertical velocity component
(w) for ADCP data. All of the ADCP data are provided as (depth, time) arrays and these
are the main occurrences of arrays with more than the one dimension of time. Similar to
time, the depth coordinate variable has more than one convention. Using COARDS/CF,
the (required) attribute positive defines the direction of the depth axis where 0 is the
ocean surface (or more strictly the datum for the bathymetry). Most oceanographic
providers use the value “down” for positive, and this is often implied if the attribute is not
given. Therefore, with this convention, the height of a wind sensor on a meteorological
buoy is negative. In another convention, the instrument depth is given as a positive
height above the seabed (e.g. for bottom tripod systems). An aggregation server must be
able to take into account these coordinate differences.
Where meteorological data are provided, some are from the institution’s own buoys, but
others provide data obtained from government sources (e.g. from NDBC buoys and CMAN stations). In a similar manner, but not noted in the table, a few sites also provide
sea-level data from NOS tide gauges. The availability of government data from nongovernment sites points to particular time series (either in part or complete) being
available from more than one source. Even though the origins of the data are the same,
the time series data from multiple sites may differ in subtle ways. For example a NDBC
wind record may have been gap filled and filtered at one site, but not at another. This
would indicate that processing/QC flags and source information could be important
Version 1.0
February 2004
8
parameters of the selection mechanism for aggregation of datasets. This will be
discussed more in Section 3.2.
2.3.
User Requirements
The development of aggregation servers along with search and retrieval software is
directed at the user. There will be different categories of users ranging from those
requiring simple displays of data on a map to those requiring inputs into complex
analyses. The point of view of this document is that if the data structures for aggregation
servers can fulfill the needs of the working scientist, then the needs of other users can be
accommodated. This does not mean that user interfaces may not differ for different user
groups, but that the information, returned as a part of the aggregation data structure,
should be complete enough for such multiple interface developments, if this information
is designed to meet the needs of the most sophisticated user (see goals 1 and 2, Chapter
1).
Beyond simple statistics, the majority of time series analysis techniques require that the
data be equally spaced in time and have no gaps or missing values. Examples are
filtering, spectra, correlations, and principal component analysis (or EOF’s). The no
missing value requirement may be relaxed for some methods, such as for the least-square
fits used by tidal analysis. The data structures, however, should be able to accommodate
all types of time series, but it would be useful to the analyst if there were attributes of the
time series that indicated its state (e.g. the series equally spaced and have no gaps). This
could eliminate the need for many processing steps such as applying time checks and
searching for missing values and interpolating gaps.
Finding and selecting time series datasets are also crucial parts of use of the NVODS
system. The user will need to perform sophisticated complex searches of metadata in
order to focus an analysis on appropriate time series. This topic is not strictly within the
purview of this report. However, by considering the types of query that an aggregation
server may be required to service, and the possible methods that may be used to locate
data, some of the requirements for multiple time series data structures are clarified. In
particular, a close relationship of the data structure to relational database tables, which
could be used to store metadata, could be advantageous to implementations of
aggregation servers.
An example of a query that a search engine / aggregation server would need to service is:
Find all the current records that are below 1000-m depth, have durations longer
than 6 months, were located in the Gulf of Mexico, and overlap the period
January 1, 1990 to December 31, 1993.
Note that this query would potentially retrieve datasets from many different moorings,
and therefore, an aggregation data structure should not be restricted to records from a
single mooring. This query may be further restricted by requirements on data
organization. For example:
Version 1.0
February 2004
9
Restrict returns to equally spaced data with time steps less than or equal to 1
hour, and that use East and North coordinate axes.
This could reduce the number of individual time series selected, however, there may
duplicates if the data were stored in more than one file on the provider site (i.e. different
versions or processing levels) or the more than one provider site has copies of the dataset.
Therefore, it is likely that the returns would require order criteria. For example:
Order by increasing depth, global data quality, source institution, and instrument
manufacturer.
These types of queries could be satisfied if the all the metadata of the time series datasets
of the providers were catalogued in relational tables. The queries above could be
formulated as SQL statements, of some complexity, depending on how the tables are
organized and the information included in the table columns. Such an arrangement
would also allow the user to refine the search by just interacting with the relational tables
and not doing time consuming queries of individual provider sites. However, if data
returns were further restricted by the values in the time series, just searching metadata
tables would not be sufficient. For example:
Restrict returned data to values where the current speed is greater than 60 cm/s.
If the above data value criteria were allowed, then the aggregation server may have to
further filter the time series values, before sending the data to the user. The returned data
would no longer be equally spaced because possibly only a few (or no) data values in
each selected time series may satisfy the criteria.
A possible organization of such a search and retrieval system is sketched in Figure 3.
The search engine has been separated from the aggregation server because they are
logically distinct operations. The idea is that the search engine would periodically poll
the individual provider sites and populate a metadata database that describes each time
series on the provider site along with its location. It is expected that this database would
be in the form of relational tables. Thus, after initialization, the provider sites would only
provide metadata if the site is updated. Providing the metadata to the database will be a
complex task because each individual site will have different conventions and translation
tables will be needed to interpret attributes of the local time series in terms of the
relational table column names. However, this need only be done once as long as the local
conventions of the time series files remain unchanged. Potentially many millions of time
series could be catalogued this way. The advantages of the use of a relational database to
organize the metadata for time series from multiple sites are as follows:
1) Searches are local and efficient and can use standard SQL and already
developed query software. Iterative queries to refine the selection only
involve interaction between the user and the database. Searching metadata
Version 1.0
February 2004
10
2)
3)
4)
5)
tables should be a lot more efficient than searching through the content of
millions of files scattered over a few hundred, provider sites.
The relational table structures impose discipline on the metadata and can
enforce rules.
The metadata of requested records could be directly supplied to the
Aggregation engine, thus, bypassing the individual sites. The individual
sites could then just supply the time series arrays to the aggregation server.
The overhead of translating the time series metadata, every time a time
series record is delivered, would be saved.
A lot of authoritive support information, such as instrument descriptions
or filter characteristics, could be stored in the database under well-known
keys (ID’s). This could encourage a central repository for this type of
information, which the provider sites could import and incorporate in their
files.
Any work generating nice user-friendly interfaces would only have to
done for the query engine sites and not the individual provider sites.
Therefore, the concept is that locations of selected time series, along with appropriate
standardized metadata, are passed to the aggregation server, which then assembles the
metadata and retrieves the time series from the provider sites. The aggregation server
applies any needed or requested conversions and scaling, and uses the multiple time
series data structure to return data to the user.
3. Data Structure Proposal
3.1.
Restrictions
A number of restrictions are proposed for aggregate time series structures. Some have
practical consequences in that they will simplify the data stream, and others are more
related to the workshop’s philosophy for aggregation structures. They are:
1) Whether a single or multiple time series datasets are returned to the user,
the resulting data stream will be a single entity (i.e. a DODS file).
This adds complexity to the data structure, but the alternative is to return each time series
dataset as a separate file. Such a set of files may be large, and thus, there is more of a
burden on the user to sort through the results. If the results are in one compact structure,
it should be easier to select or discard individual series. On the other hand, user API’s
may need to be modified to except more complex data streams.
2) If a data structure contains more than one time series variable (e.g.
temperature and salinity), the measurements must be co-located in space
and time.
Version 1.0
February 2004
11
The original proposal was that only one type of variable (scalar or vector) would be
returned per request. The argument being that if the user wanted another variable, a
second request could be made. However, the workshop thought that this was too
restrictive and the user should be able to request more than one variable at a time. If the
restriction is made that all the variables requested are from the same locations (i.e.
instruments – there may be many instruments) and have the same time sampling, then a
logical structure containing more than one variable could be constructed. Thus, if the
input file contained velocity components and temperature from the same current meter,
and all the time series arrays had the same dimension size, then they could be retrieved at
the same time. If the temperature array was sampled at 20-minutes and the current vector
components at 10 minutes (say), then one of the arrays (scalar or vector) would be
excluded from the results. Which one would depend on how the query was constructed.
3) Except for scaling required to make units consistent, the aggregation
server would not alter any time series values received from provider sites.
This implies that the aggregation server performs no time or depth averaging or
interpolation. This is essentially a data integrity constraint. Therefore, using a fixed
(depth, time) array, organized by mooring, for the aggregation structure is too restrictive
given the nature of moored time series measurements that might have to be
accommodated (see Section 2.1).
4) The aggregation data structure should be adaptable to other types of
oceanographic data (e.g. Cast (CTD) or lagrangian float).
Thus, any adopted data structure should be flexible and with only small adjustments be
useable for a wide range of space and time variable data.
3.2.
Aggregation Data Structure Using Arrays
The netcdf model will be used to describe array structures for the reasons given above,
even though the DODS grid structure is independent of the file format used for inputs.
Thus, the aggregation data structure will have the usual sections of dimensions, global
attributes, variables and variable attributes. Though not specified below, variables have
the usual DODS base data types of byte, int32, float64, and string. Note that arrays of
strings in netcdf have to be dimensioned with the maximum length of the string, because
strings are treated as arrays of single characters (e.g. array_of_strings(number_of_strings,
max_length_of_strings). DODS also indexes arrays with a base of 0 (similar to C). The
basic idea of the aggregation data structure is that time series data of the same type (e.g.
temperature) are packed into a single one-dimensional array, T.
If the number of time series is M, and the number of data values in each series is ni ,
where i = 0 … M-1, then the array is given as:
T [0 … n0-1, n0 … n0+n1-1, n0+n1 … n0+n1+n2+ 1,
…
,
M 2
 ni …
i 0
Version 1.0
February 2004
12


M 1
 n 1 ]
i
i 0
(1)
To extract a single time series from this array, a start_index and number_of_points are
required. For series j :
Start_index =
j1
n ;
i
and the stop_index = start_index + nj –1.
(2)
i 0
Note that the array T [ …] has a base data type (e.g. float64) and all elements have the
same units. Therefore, the aggregation server may need to convert some of the input time
(e.g. from Fahrenheit to Celsius), and provide uniform _FillValues.
series
As a practical matter, the aggregation server will need to use a standard set of variable
names (e.g. T for temperature). The COARDS/CF standards do not have any restrictions
on variable names, but rather define their attributes. Variable names differ for each
provider institution for the same physical quantity, however, the standard_name CF
attribute could provide an identification mechanism as discussed above. The variable
names adopted by the aggregation server should probably be already in fairly general use.
The Coriolis Project has proposed a standardized variable naming system for seawater
temperature (TEMP), salinity (PSAL), etc.
A basic aggregation data structure is given in a CDL format {1}2.
dimensions:
series = M ;
// {2}
//number of time series periods corresponding to separate
instruments.
time = UNLIMITED ;
// Global Attributes {3}
// information on aggregation data processing, standard COARDS attributes.
:title = “NVODS Aggregation Server for Time Series” ;
:references = ;
//URL’s of documentation
:source = “ocean observations from moorings” ;
variables:
//Reduction in rank possible if arrays are constant {4}
// Series Locations and Descriptions {5}
series(series) ;
//time series ID’s (constructed by the aggregation server?).
series:coordinates = “depth latitude longitude” ;
//CF coordinates attribute {6}
latitude(series) ;
//mooring latitude {7}
longitude(series) ;
//mooring longitude
depth(series) ;
//measurement depth
depth:positive = “down” ;
//CF/COARDS vertical
//coordinate attribute.
mooring_ID(series) ;
//defined by provider
water_depth(series) ;
//water depth of mooring
instrument_ID(series) ;
//community definition {8}
2
See notes below
Version 1.0
February 2004
13
instrument_depth(series) ;
//optional: used if any of the
//instrument depths differ from the measurement depths (e.g. if an ADCP).
site_URL(series) ;
//where time series originated
// Time series parameters {9}
start_time(series) = “yyyy-mm-dd hh:mm:ss” ;
//FGDC/COARDS date
//corresponds to start_index.
time_step(series) ;
//Only defined for equally spaced data
time_step:units = “minutes”;
//defined by user {10}
npts(series) = nj ;
//number of data points in series j.
start_index(series) =
j1
n
i
;
//as defined above.
i 0
// Time series processing flags {11}
equally_spaced(series) = “y or n” ;
no_fill_values(series) = “y or n” ;

filters_applied(series)
//May be replaced by a set of community
// defined integer flags.
//community definitions.
// Aggregated variables (names assigned and data appropriately scaled by the server)
T(time) ;
//temperature T[…] defined by (1)
T:units = “degrees_C” ;
//defined by user
T:_FillValue = -9999 ;
//defined by user
T:standard_name = “sea_water_temperature” ;
//CF required attribute
// Examples of additional co-located aggregated dependent variables
U(time) ;
//East-component of current U[…]
U:units = “cm/s” ;
//defined by user
U:standard_name = “eastward_sea_water_velocity” ;
//CF attribute
V(time) ;
//North component of current V[…]
V:units = “cm/s” ;
//defined by user
V:standard_name = “northward_sea_water_velocity” ;
//CF attribute
// Independent variables (form defined by user) {12}
time(time) ;
//time of all data points in T[…], etc.
time:units = “minutes since 1950-01-01 00:00:00”
//CF/COARDS
// Data QA/QC flags (optional) {13}
T_qc(time) ;
//Temperature QC flags
T_qc:flag_values = (0, 1 ) ;
//community defined (CF attribute)
T_qc:flag_meanings = “good_data corrected_or_interpolated data” ;
Notes:
1) Data types (e.g. int, float64, etc.) have been left out of this description.
Values given to attributes are only for illustration.
2) This section will require character array dimensions for any
“arrays_of_strings” in the file. All arrays with the time dimension have
exactly the same number of points.
3) Information is supplied in the Global Attributes that apply to the whole of
the aggregated data. The examples are a minimal list.
Version 1.0
February 2004
14
4) Where the elements of an array have constant values (e.g. latitude and
longitude, if the file contains data from a single mooring), the arrays can
be reduced in rank to become scalars. For conciseness, CF/COARDS
attributes such as standard_name, long_name, _FillValue, units,
valid_range, etc. have been omitted for most variables. Where these
attributes are present, they are included for emphasis because they are
considered essential. This is again a minimal list of variables describing
the location (x, y, z) and instrument characteristics of each time series in
the aggregation.
5) CF recommended attribute to associate a single spatial dimension (time
series number) with the independent named position coordinates.
6) The spatial coordinates (latitude, longitude and depth) are here considered
time independent. If any of these coordinates were time dependant, as
discussed in Section 2.1, then they would be defined in the Aggregated
independent variables section, and nominal values used in this section.
Thus, if depth is time dependent, then it is replaced here by:
nominal_depth(series) ;
and
depth(time) ;
is added to the Aggregated Independent Variables.
8) “Community definition” is meant to suggest that if there were a sanctioned
list of metadata associated with a unique ID for a variable, then a
potentially large amount of information can be referenced compactly and
unambiguously. For example, if instrument characteristics were cataloged
with standard metadata, including precision and accuracy, then a single ID
could reference all this information that would be helpful for the user’s
evaluation of each time series. A suitable attribute would indicate how the
ID’s should be used, e.g.
instrument_ID:conventions = “Oceanographic Instrument List v-1.0”
Such conventions must be a community effort that involves instrument
manufacturers.
9) “User defined” implies that he or she requested the units and sometimes
the format of the variables that are to be returned by the aggregation
server.
10) If a time series has equally spaced data points, then the time of any data
point can be found knowing the start_time of the series, the time_step and
the data point index (relative to start_index for aggregated series). If all
series in the aggregation have equally spaced points, then an independent
time variable is not strictly needed, however, it probably should still be
provided because it may be expected by some processing API’s.
11) A set of conventions, developed by the community, would be useful to
indicate the processing level of each time series. This is a separate issue
to defining data value flags. Defining whether a series is equally spaced
and has no missing values would be very useful to the user. Similarly,
some standard way of indicating whether filters or time averaging have
been applied helps determine how the data can be used.
Version 1.0
February 2004
15
12) The user sets characteristics of the independent variables. Thus, for time,
it might be requested that it be in the form of year, month, day, hour,
minute and second, returned either as a string or as separate integer
variables. If COARDS times are requested, the user sets the reference
date and the units.
13) Most institutions have some kind of integer flags defined to indicate data
quality of measurements. A very simple scheme of two flags (0 and 1) is
illustrated here. A more elaborate scheme is proposed for use by the
Coriolis project that uses flags (0 – 9). Again, community sanctioned
definitions would be useful. Many providers may only use a subset and
the aggregation server may need to translate meanings. The recommended
CF attributes flag_values and flag_meanings can be used to define flags.
Note that QC flags are not always available and in some cases would be
meaningless (e.g. for filtered time series).
The proposed data structure has the following virtues:
1) The array storage is compact in that there is a minimum of wasted space
that could be caused by using 2 or 3-D arrays to contain the multiple time
series. DODS structures were examined as a possible storage mechanism,
but arrays of structures have to have the same dimensions and this would
be violated by multiple time series of differing numbers of points.
2) If series = M = 1, the structure would reduce to something very similar to
many netcdf time series files for point-measurement instruments.
Instruments (e.g. ADCP’s) that generate (depth, time) 2-D arrays are not
as elegantly accommodated. However, reconstructing a 2-D from a 1-D
array using start_index and npts, is a straightforward operation if npts is
constant (e.g. FORTRAN 90 reshape function or equivalent).
3) By placing the metadata in arrays, indexed by the time series number
(series), organization and use of this data is simplified and the data
structure avoids proliferation of variable names. The latter would occur if
each time series were given a separate structure. Populating these arrays
from queries of relational tables of time series metadata should be
reasonably straightforward.
4) The data structure can easily be adapted to different data types. For
example, making the variables latitude, longitude, and depth aggregated
time arrays, could accommodate lagrangian data from multiple RAFOS
floats. Making depth the aggregating dimension could accommodate
CTD cast data from hydrographic cruises. Thus, in the above structure:
dimensions:
depth = UNLIMITED;
//Change time to depth
variables:
series(series) ;
series:coordinates = “time latitude longitude” ; //Change coordinate
//dependence
time(series) ;
//add time of cast
Version 1.0
February 2004
16
//Aggregated variables
T(depth) ;
S(depth) ;
depth(depth) ;
//Temperature array for multiple casts
//Salinity array for multiple casts
//depths of T,S measurements
Other metadata variables would probably need to be defined for this case.
3.3.
Relational Tables
Relational databases are used by many institutions to manage large quantities of data.
For example, WOCE use a relational database to catalog and search their data
inventories. The use of relational databases to catalog time series metadata was discussed
in Section 2.3 from the point of view of searching and retrieval of data. The structure of
the tables would serve to supply information for the variables in the aggregation data
structure discussed above. Therefore, every time series array that is accessible by an
aggregation server, would have an entry in the metadata tables. Design of the tables is
beyond the scope of this document, but some ideas on the connections between the
aggregation data structure are presented here for future reference.
time_series //table name
id
//a unique identifier of form: provider_URL/filename/variable_name
standard_name
//identifies the variable and measurement type
long_name
//description of variable
ancillary_variables //identifies names of QC flag arrays (e.g.T_qc) in same file
_FillValue
//missing value indicator
units
//measurement units
start_time
stop_time
npts
//number of data points
time_step
//not defined if not equally spaced
data_quality
//flags for equally spaced and no missing values
filter_code
//Codes (Foreign Key) for processing levels: link to a support table
depths
//Measurement depths of the array (for ADCP’s)
location_id
//Foreign Key to identify entry in location table
location
//table name
location_id
//unique Key
platform_id //Foreign Key to identify platform type and position (platform table)
instrument_id //Foreign Key to identify instrument characteristics (instrument table)
serial
//Instrument serial number
instrument_depth
//depth of instrument
Version 1.0
February 2004
17
platform
platform_id
latitude
longitude
water_depth
code
owner
//table name
//unique Key
instrument
instrument_id
description
sensors
accuracy
precision
manufacturer
comments
//table name
//unique Key
//e.g. current meter, ADCP, etc. (standard codes)
//e.g. velocity components, temperature (standard codes)
//accuracy of sensors
//precision of sensors
//Code describing type of mooring (e.g. subsurface, bottom tripod, etc.)
//Information on the institution using or deploying the mooring.
In this set of tables, the id in the time_series table is used to locate the array of
measurement data inside a file on a provider’s DODS site. The use of codes to identify
moorings types and instrument sensors implies that support tables, describing these, will
be needed. These subsidiary tables are not given here so as to keep the table structures
reasonably simple. There is a one to many relationship between the keys. Thus, many
time series id’s will have the same location_id, and many locations_id’s will have the
same platform_id. Even this fairly straightforward table structure, which corresponds
quite closely to the aggregation data structure variables and attributes, will require
complex translation mechanisms for the contents of each providers site that could be used
to populate this database.
4. Summary and Recommendations
A flexible data structure has been proposed for aggregated moored time series data. It is
designed to accommodate multiple time series arrays of varying characteristics in a single
structure (file) that employs packed 1-D arrays or grids. This document outlines the
reasons behind the design of this data structure, and should be used as guidance for more
detailed design studies. In order to make the structure reasonably compatible with
existing NVODS moored data provider sites, netcdf conventions have been adopted, and
the structure is designed to use simple arrays or grids that are consistent with netcdf files.
Adapting user API’s that already use time series in netcdf or DODS grid format, should
be straightforward, requiring the extracting of segments of 1-D arrays.
Community conventions established by COARDS/CF for netcdf attributes should be
adopted for the metadata included in the aggregation data structure. Similarly,
community initiatives, for marine metadata and standard variable names, provide
Version 1.0
February 2004
18
guidance for the variables and their attributes that may be included in the data structure.
However, only require metadata be present that is directly useful to the end user-analyst.
Thus, metadata for the location of a time series are required, but details of instrument
calibration may not. The variables and attributes included in the data structure design
(Section 3.2) are considered a minimal required list.
Specific recommendations for implementing the data structure are:
1) Use community established conventions wherever possible.
2) Encourage CF to adopt more standard_names that relate to ocean
measurements.
3) Encourage the development of new conventions for describing entities in
the data structure. This document has identified the following topics that
would benefit from community wide standards:
 Standardized Instrument descriptions, including sensor accuracy
and precision.
 Standardized measurement QC flags.
 Standardized descriptions of the levels of processing for time
series. This includes specifying equal/non-equal time spacing,
existence of gaps, and filter codes and descriptions.
4) Implement relational database tables that parallel the structure, for use in
cataloging, searching, and standardizing variables names for time series
metadata.
Version 1.0
February 2004
19
Appendix
Workshop Attendees
Version 1.0
February 2004
20