Download CSC_NEXRAD_DW

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Transcript
An Architecture for Real-Time
Warehousing of Scientific Data
Ramon Lawrence and Anton Kruger
IIHR, University of Iowa
[email protected]
http://www.cs.uiowa.edu/~rlawrenc/
http://www.iihr.uiowa.edu/~hml/projects/nexrad-itr
Overview
Our goal is to build a general archival architecture for storing
and querying massive amounts of scientific data.
This presentation will discuss our current architecture and how it
is being used in a national project to archive weather radar data
in the United States.
The architecture achieves four basic design goals:
1) scalable - can handle terabyte-scale data sets
2) extensible - types of data and metadata stored can change
3) inexpensive - uses cheap hardware and open-source software
4) usable - researchers can interact with the system in a variety
of intuitive ways
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 2
Motivation
The size of scientific data sets in many domains is increasing
dramatically. This is placing a burden on IT infrastructure for
storing, processing, and querying the data effectively.
As sensor networks are deployed, this will get even worse.
Although data warehousing techniques are well-known, it is an
impediment to research to manage data sets of this scale.
One of the most basic challenges is finding data relevant to the
research (the data finding problem). To avoid browsing a large
data set, suitable metadata describing the data must be
generated, stored, and queryable by the researcher.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 3
Desirable Architecture Properties
Our architecture is designed with four key properties:
1) scalable - The system can accommodate more data simply
by adding low-cost PCs. Data files are transparently allocated
and replicated across nodes without custom hardware/software.
2) extensible - The types of metadata generated and stored
may change over time as the research evolves.
3) inexpensive - Low cost hardware and open-source software
is used.
4) usable - Researcher can interact with data archive in a
variety of ways including directly through C code, web forms, or
web services.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 4
Archive Architecture Overview
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 5
Architecture Components
The components:
Extractor - is the only component specific to the data set. It is
the code module for computing desired metadata statistics on
the data. The output is a standard XML schema defined by the
Loader.
Loader - is the module responsible for storing metadata in the
database and using rules to place data files on retrieval
servers. This component is not data set specific. Different and
evolving metadata is supported by a general database schema.
Metadata archive - is a relational database that stores the
metadata and pointers to the data. SQL queries are built using
the various front-end tools (C code, web interface, etc.) to query
metadata to find data with specific properties and file locations.
Retrieval server - is any machine capable of running a HTTP
server and acting as a data file store.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 6
Case Study: Archiving NEXRAD Data
Our goal is to provide the community with access to the vast
archives and real-time data collected by the NEXRAD system.
There
are over 150 NEXt generation RADars (NEXRAD) that
collect real-time precipitation data across the United States.
The system has been operational for about 10 years, and the amount of
collected data is continually expanding.
How a radar works:
A radar emits a coherent train of
microwave pulses and processes
reflected pulses.
Each processed pulse corresponds
to a bin. There are multiple bins in
a ray (beam). Rotating the radar
360º is a sweep. After a sweep the
radar elevation angle is increased,
and another sweep performed. All
sweeps together form a volume.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 7
Usefulness of NEXRAD Data
Although the NEXRAD system was designed for severe
weather forecasting, data collected has been used in many
areas including:
flood prediction
bird and insect migration
rainfall estimation
The value of this data has been noted by a NRC report which
labeled it a “critical resource.”
Enhancing Access to NEXRAD Data—A Critical National Resource.
National Academy Press, Washington D.C. ISBN 0-309-06636-0, 1999
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 8
Archiving NEXRAD Data
Despite its value, the archival system for NEXRAD data is
unsatisfactory. The National Climatic Data Center (NCDC)
maintains a tape archive of the RAW data, but provides few
tools for finding relevant data and processing it for research.
Some real-time data is distributed by University Corporation for
Atmospheric Research (UCAR) using their Unidata Internet
Data Distribution (IDD) system. However, this still requires
users be able to:
extract and process a RAW data stream in real-time
archive it appropriately
generate metadata and indexes for retrieving it when required
filter the data set to reduce the amount of space required
develop custom tools for analysis and processing
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 9
Data Size Challenges
Individual NEXRAD Level II scans are not large (300-1000 KB).
However, archiving 150 radars that produce 10 scans per hour
results in an archive rate of 36,000 scans/day = 17 GB/day.
Although the cost of storage has decreased dramatically (1 TB
for under $10,000), this still requires a hardware investment.
A major challenge is how do you find the data files of interest?
Answer: Queryable metadata that allows you to ask for files
with certain properties without browsing the entire collection.
One problem: The metadata can be huge as well making it
inefficient to search. Even worse, scientific metadata tends to
change as research evolves.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 10
User/Client’s View
“Find all the 2002 storms over the Ralston Creek watershed
with mean areal precipitation greater than X mm, and with a
spatial extent of more than Z km2, with a duration of less than
N hours. I want the data in GeoTIFF.”
Metadata
Archive
“Find all the 2002 storms over the
Ralston Creek watershed with
mean areal precipitation greater
than X mm, and with a spatial
extent of more than Z km2, with
a duration of less than N hours.
I want the data in GeoTIFF”
Query Metadata
Metadata Archive
User/Client
Distributed
Data Archive
Get URIs
Get data
Program
Library
(NCDC, Iowa, etc.)
HTTP
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 11
Current Status and Future Work
We have implemented a prototype version of the architecture
that is currently archiving 30 radars in real-time. Some basic
statistics are being generated and can be used to retrieve data
files of interest. Accessible at:
http://nexrad.cs.uiowa.edu
Immediate plans:
Generate standardized metadata for use by hydrologists.
Link NEXRAD data to basin information so that rainfall
estimation and flood prediction can be performed.
This research is supported by NSF ITR Grant ATM 0427422: “A
Comprehensive Framework for Use of NEXRAD Data in
Hydrometeorology and Hydrology”.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 12
NEXRAD Project Participants
The University of Iowa (Lead)
W.F. Krajewski (PI)
A.A.
Bradley, A. Kruger, R. Lawrence
Princeton University
J.A. Smith (PI)
M. Steiner, M.L.Baeck
National Climatic Data Center
S.A. Delgreco (PI)
S. Ansari
UCAR/Unidata Program Center
M. K. Ramamurthy (PI)
W.J. Weber
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 13
An Architecture for Real-Time
Warehousing of Scientific Data
Ramon Lawrence and Anton Kruger
IIHR, University of Iowa
[email protected]
http://www.cs.uiowa.edu/~rlawrenc/
http://www.iihr.uiowa.edu/~hml/projects/nexrad-itr
Thank You!
Extra Slides...
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 15
NEXRAD Data Management Challenges
Storing NEXRAD Level II data results in many interesting
database challenges:
Data size - A historical archive of NEXRAD data consumes
many terabytes of space.
Flexibility/Variability - Unlike commercial warehouses, the
types of data and metadata that should be stored in the
warehouse is not well understood and evolves over time.
Real-Time response - The data should be loaded and
queryable in real-time as it is received from the radars.
Scientific Workflow - It is desirable to capture and share
sequences of calculations on the raw data (scientific workflows)
and develop tools that seemlessly interact with the archive.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 16
Flexibility Challenges
Ideally, the system should allow arbitrary metadata to be
associated with NEXRAD files that can easily be added,
updated, and queried.
Unfortunately, relational databases do not nicely handle variable
information. Although there are some known schema designs
that can handle variability, they are inefficient for large data sets.
Good news: This is not unique to hydrology. Researchers in
other domains are building grids to share data/metadata and
face the same challenges (e.g. GriPhyn - physics grid).
Bad news: Representing and querying variable data (especially
within a relational database) is an active research problem.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 17
Flexibility Example
One way to represent variable metadata on a datafile in a
relational database is to have a single table:
metadata(dataFileId, attributeName, attributeValue)
Example:
1
1
1
2
2
3
ArealCoverage
MaximumReflectivity
MinimumReflectivity
ArealCoverage
PercentGroundClutter
AverageReflectivity
10
30
-5
20
15
15
Data file 1 has three attributes: ArealCoverage, MaximumReflectivity,
MinimumReflectivity. Data file 2 has two attributes, and file 3 has only 1.
Note that this schema allows any (variable) number of attributes per file.
A challenge:
How would you return all files that have
ArealCoverage > 5 and MaximumReflectivity > 20?
Answer: Join two copies of table metadata together.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 18
Scientific Workflow
A workflow is a sequence of steps that is performed on data.
Workflows have received considerable attention where
documents must be routed between individuals.
Think of a funding proposal being internally routed through your university.
A scientific workflow is a sequence of steps performed on
scientific data. Each step uses as input the output of the
previous step. An example workflow in hydrology:
retrieve the raw data files of interest
remove ground clutter and Anomalous Propagation (AP)
calculate estimated rain fall
map calculations to a basin
Our goal is to support such workflows.
How to represent and store intermediary products?
How to make the tools/algorithms interoperable?
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 19
A Watershed or Basin
A watershed is an area of land that drains water, sediment and dissolved
materials to a common receiving body or outlet.
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 20
NRC Quote on NEXRAD Data Archiving
“[t]he limited use of ground-based radar
rainfall data outside of the operational
environment is partially attributed to the
lack of research-quality data products and
partially to poor archiving practices.”
NRC Report,
2002
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 21
Metadata
Basic
“Find all the 2002 storms over the Ralston Creek
watershed with mean areal precipitation greater than X
mm, and with a spatial extent of more than Z km2, with a
duration of less than N hours. I want the data in GeoTIFF”
Derived/Complex
“Find all the 2002 storms over the Ralston Creek
watershed with mean areal precipitation greater than X
mm, and with a spatial extent of more than Z km2, with a
duration of less than N hours. I want the data in GeoTIFF”
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 22
CUAHSI
Consortium of Universities for the Advancement of
Hydrologic Sciences (CUAHSI)
Ramon Lawrence An Architecture for Real-Time
Warehousing of Scientific Data
The University of Iowa.
Copyright© 2005
Page 23