Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods David Gallaher(1), Qin Lv(2), Glenn Grant(1), Garrett Campbell(1) 1) 2) 1 National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA Department of Computer Science, University of Colorado, Boulder, Colorado, 80309, USA Data Rods: High Speed, Time-Series Analysis of Massive Data Sets The National Snow and Ice Data Center Mission: To Monitor the Climate Data in Earth’s Icy Regions, Analyze and Distribute it Worldwide 24x7. Focus is Mainly NASA Satellite Data Manages and distributes scientific data Supports data users Performs scientific research University of Colorado at Boulder Cooperative Institute for Research in Environmental Sciences World Data Center for Glaciology (since 1976) Creates tools for data access Affiliations and Sponsorship Educates the public about the cryosphere Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Data Rods - Project Basis The “Data Rods” project proposes to create prototype a high-speed, scalable database structure for rapid retrieval, filtering, and analysis of massive multi-modality data sets. Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Objective: Remote Sensing Data Analysis The Problem: • Data sets are becoming too large to move over the internet • Need for basic Boolean logic for time-series anomaly detection • Data downloads for long time-series analysis are especially cumbersome Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Analysis Challenges • A wide variety of data formats • Ever-increasing data set sizes • Myriad analysis and visualization requirements • There will be uses and analysis of the data that cannot be anticipated (data discovery is not enough) • Lack of direct access to the data (ie albedo > 15%) • Our current directory trees impede data access (We really need to consider a database) Data Rods: High Speed, Time-Series Analysis of Massive Data Sets “Big Data” Considerations: Search, Order and Transmission of data is ending. •We must develop systems where the data stay fixed and analyses are rendered against it •Rapid, scalable data access across time and space •Direct query of the data, not just the metadata (we need more than what, where, when) •Web-based spatio-temporal analysis and visualization 6 Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Database Choice Fast and efficient storage, query and retrieval of entire data sets – not just the metadata Ability to store colossal amounts of small files Relational databases can't handle it. The tables grow too big. (Object-relational is no better) Hadoop excels at unstructured data but due to it’s batch oriented nature, it is inefficient with real-time analytics as well as intra-data analysis A “pure-object” database seen as best choice Data Rods: High Speed, Time-Series Analysis of Massive Data Sets The Data Rods Project The “Data Rods” project has created a high speed, scalable database structure for rapid retrieval, filtering, and analysis of massive data sets. We’ll cover the following: • Database design • Status on development • Basic analysis examples and performance • Planned analysis and potential applications Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Database design Gridded data is key. For consistency, NSIDC's Equal-Area Scalable Earth Grids (EASE-Grids) tool is used. Common resolutions between data sets (1km, 5km, etc) and point data Data Rods: High Speed, Time-Series Analysis of Massive Data Sets The nesting relationship of differing resolutions in EASE-Grid Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Data Rods Concept Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Database Systems Development Object Database Design Passive Microwave Visual Infrared Active Microwave Radar Other Ease Grid Processing Pixel Grid Sampling Data Rod Objects Cryospheric Change Analysis Basic Data Management (query & index) Object Interface Pattern Search (input pattern or trend) Object Database Loading Automated Pattern Discovery Data Rod Updating • Anomaly Detection • Trend Detection • Cycle Detection Data Rods: High Speed, Time-Series Analysis of Massive Data Sets User Interface Pure-Object Database Object persistence/instantiation is directly to/from the database – no Java Spring or Hibernate needed Not object-relational (examples include Versant, ObjectDB, db4o, Objectivity) Not as limited by size Fast interactions across databases Simple, efficient schema Next: schema design Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Object Database Schema Each image pixel is an object Data rods are time-series collections of pixels Each data rod can be analyzed independently Adjacency analysis by row/col or lat/lon Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Database Creation Standardized grid dimensions Visualize as layers of imagery through time (days to decades) Time Gridded data sets Lends itself well to time-series analysis Longitude Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Status – Database Administration 5 AVHRR databases, each with 5 years of imagery (<100 GB each, administratively easier) Surface mask databases for northern hemisphere at 5 km and 25km SSM/I database, 25 years of daily 25 km data at all frequencies and polarizations Selected MODIS database at 250 Meter resolution ~600 GB total No upper limit to database except disk space Data Rods: High Speed, Time-Series Analysis of Massive Data Sets AVHRR Database Creation Initial demonstration region is Greenland 25 years of daily multi-spectral AVHRR data at 5 km resolution 9000+ images 2 billion+ pixel objects total Each pixel object is independently accessible for query Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Database Flexibility Data can be spread across many databases Transparent queries across databases Methods (routines) can be attached to the data rods to add functionality such as statistical analysis Data fusion: analyses may span multiple data types, resolutions, time spans Data Rods supports NetCDF output Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Simple AVHRR Object Database Time Test • Built a using AVHRR 5km data from 1995-1999 • 2 visible channels, 3 IR channels, 3 references plus albedo, skin temperature and cloud mask • Database includes location class, time stamp class and metadata • 213,000 data rods covering 5-years over Greenland • 1 Data rod contains 1825 pixels • Pixels = 388,725,000 each with 11 variables/pixel • Variables = 4.2 billion coded short integer values Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Example Analysis Using Object Databases • All queries run on a singe processor, single thread • Example #1: Queries and plots on single database • Example #2: Queries and plots on multiple databases • Example #3: Advanced Spatiotemporal Analysis • 1 Data rod contains 1825 pixels • Pixels = 388,725,000 each with 11 variables/pixel • Variables = 4.2 billion coded short integer values • We will move to multi-tread, multiprocessor once we have the design finalized (this is a research project) Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using Single AVHRR Object Database Time Test • Single processor under load • 5-year plots returned in 2-10 seconds. • Cached data plots returned in ½ second. • Images in 10 seconds Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Multi Data Rod Selection • Seven locations selected across 5 years simultaneously • Selected Temperature Brightness and Albedo output • Again caching is much faster Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Example Analysis of Greenland & 5 databases Using 5 5-year Rods and Statistics (1 min or 5 secs cached) AVHRR albedo statistics May average, 1981 – 2005 Camp Century: Mean: 0.801 Std. dev.: 0.077 Summit Station: Mean: 0.819 Std. dev.: 0.069 Swiss Camp: Mean: 0.817 Std. dev.: 0.070 GISP Ice Core Camp: Mean: 0.802 Std. dev.: 0.071 Image ref: Maurer, J. 2007. Atlas of the Cryosphere. Boulder, Colorado USA: National Snow and Ice Data Center. Digital media. Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Temporal Analysis of Single Rods Descriptive Statistical functions Spatiotemporal data selection Filtering by value Anomaly detection Also: Image generation Inter-database data fusion Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Broad Spatiotemporal Analysis (This took some time) • Statistical analysis repeated at every grid cell. • Intersection of surface mask database and AVHRR database: only pixels on the ice sheet were processed. • Bad data filtered out. • Multivariate: cloud mask used to exclude cloudy pixels from albedo averages. • All 2 billion objects queried and analyzed Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Analysis Example: Sea Ice Temporal Query t8 t1 } We would like to remove clouds from the image (clouds move faster than ice so find minimum Albedo for open water) Moving 8-day window through datarod Minimum albedo in temporal window Pseudocode example query: datarod = database.getDatarod(row,col) Datarod timeseries of pixels albedo = datarod.getMinAlbedo(t,t+7) Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Analysis result: Sea Ice Detection Technique for removing clouds from the image Composite image created from Data Rods’ time series One of the Original images Lowest AVHRR albedo over an 8day period Remaining objective: exclude lingering clouds Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Analysis Potential: Rapid Data Fusion Loss of AMSR-E decreases sea ice detection capability Data Rods AVHRR/SSM/I product fusion may fill the gap Can be validated against AMSR-E sea ice record. AVHRR 8day + High resolution sea ice detection – still some clouds Fused product SSM/I = Cloud free with good sea ice detection but low resolution Data Rods: High Speed, Time-Series Analysis of Massive Data Sets High-res sea ice extent, no clouds Performing this lake detection analysis conventionally took 6 months (downloading & gridding & image analysis) With Data Rods, the analysis was done in 2 days (single tread, single processor) Data Rods: High Speed, Time-Series Analysis of Massive Data Sets What’s Next-Ongoing Efforts Newest version of ODB software has multi-threaded capability – to take advantage of multiprocessor machines to reduce query times Investigating Data rod performance on the Janus supercomputer with Pan-Arctic extent User Interface to Data Rod database Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Creating 1000s of Databases for Use with Massive Parallel Machines • Each database is small enough to be held in memory for each CPU (uses MPI calls) • Each database covers 5ox5ox25 years of Data Rods • Each database is capped (fixed for minimal changes) • Changes are added to the present year database for each 5ox5o Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Creating 1000s of Databases for Use with Massive Parallel Machines • With this database it should be possible perform analysis at Internet speeds • Multi-sensor analysis is relatively simple • We are starting the database loading now • 100TB database testing will occur over the summer Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Summary We can now perform high-speed time-series analysis on the server-side without downloads Scalable, massive remote sensing databases Accelerated analysis compared to traditional “search, order and transmission”’ methods Interactions across data sets – data fusion Developing UI and additional analysis tools Allow users interactive access to the data Data Rods: High Speed, Time-Series Analysis of Massive Data Sets NSIDC Data Rods Project Thank You The Data Rods project is funded by the National Science Foundation through grant: ARC 0941442 Interesting in testing Data Rods? Please contact us at: [email protected] Data Rods: High Speed, Time-Series Analysis of Massive Data Sets