Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Grid-enabled Collaborative Research Applications Internet2 Member Meeting Spring, 2003 Sara J. Graves Director, Information Technology and Systems Center University Professor, Computer Science Department University of Alabama in Huntsville Director, Information Technology Research Center National Space Science and Technology Center 256-824-6064 [email protected] http://www.itsc.uah.edu “…drowning in data but starving for knowledge” Data glut affects business, medicine, military, science How do we leverage data to make BETTER decisions??? Information User Community Collaborative Research Applications • Enabling Technologies for Collaborative Research – Grid-Enabled Data Mining Services – Interchange Technology Mark-ups – Collaboration Tools • Collaborative Research Applications on the Grid – TeraGrid Expeditions – Linked Environments for Atmospheric Discovery – Propulsion Research: Rocket Engine Advancement Project 2 Data Mining • Automated discovery of patterns, anomalies from vast observational data sets • Derived knowledge for decision making, predictions and disaster response • ADaM – Algorithm Development and Mining System http://datamining.itsc.uah.edu Mining Environment: When,Where, Who and Why? WHEN •Real Time •On-Ingest •On-Demand •Repeatedly WHERE •User Workstation •Data Mining Center •GRID WHO •End Users •Domain Experts •Mining Experts Data Mining WHY •Event •Relationship •Association •Corroboration •Collaboration Iterative Nature of the Data Mining Process KNOWLEDGE EVALUATION And PRESENTATION DISCOVERY MINING CLEANING And INTEGRATION PREPROCESSING DATA SELECTION And TRANSFORMATION ADaM Engine Architecture Results Translated Data Data Preprocessed Data Patterns/ Models Processing Input Preprocessing Analysis Output HDF HDF-EOS GIF PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others... GIF Images HDF-EOS HDF Raster Images HDF SDS Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images Others... Intergraph Raster Others... Mining Environments Multilevel Mining (ADaM) – – – – – – Complete System (Client and Engine) Mining Engine (User provides its own client) Application Specific Mining Systems Operations Tool Kit Stand Alone Mining Algorithms Data Fusion Distributed/Federated Mining – Distributed services – Distributed data – Chaining using Interchange Technologies On-board Mining (EVE) – Real time and distributed mining – Processing environment constraints Grid-Enabled Data Mining Services • Distributed researchers, data sources, storage and computational resources in a secure environment • ADaM data mining modules as Open Grid Services Architecture (OGSA) services Data Mining / Earth Science Collaboration: Tropical Cyclone Detection Advanced Microwave Sounding Unit (AMSU-A) Data Calibration/ Limb Correction/ Converted to Tb Mining Plan: • Water cover mask to eliminate land • Laplacian filter to compute temperature gradients • Science Algorithm to estimate wind speed • Contiguous regions with wind speeds above a desired threshold identified • Additional test to eliminate false positives • Maximum wind speed and location produced Further Analysis Knowledge Base Data Archive Hurricane Floyd Mining Environment Result Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center, and stored for further analysis http://pm-esip.msfc.nasa.gov/cyclone Data Mining / Earth Science Collaboration: Classification Based on Texture Features Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds Comparison based on – Accuracy of detection – Amount of time required to classify Parallel Version of Cloud Extraction • GOES images can be used to recognize GOES Image cumulus cloud fields Sobel Horizontal Sobel Vertical Laplacian Filter Filter Filter • Cumulus clouds are small and do not Energy Energy Energy Energy Computation Computation Computation Computation show up well in 4km resolution IR channels Classifier • Detection of cumulus cloud fields in GOES can be accomplished Cloud Image by using texture GOES Image Cumulus Cloud features or edge Mask detectors • Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster Data Mining / Earth Science Collaboration: Detecting Signatures • Detecting mesocyclone signatures from Radar data • Science Rationale: Mesocyclone is an indicator of Tornadic activity • Developing an algorithm based on wind velocity shear signatures – Improve accuracy and reduce false alarm rates Data Mining / Space Science Collaboration: Boundary Detection and Quantification • Analysis of polar cap auroras in large volumes of spacecraft UV images • Scientific Rationale: – Indicators to predict geomagnetic storm A B C D • Damage satellites • Disrupt radio connection • Developing different mining algorithms to detect and quantify polar cap boundary Polar Cap Boundary Data Mining / BioInformatics Collaboration: Genome Patterns Text Pattern Recognition: Used to search for text patterns in bioscience data as well as other text documents. Scientists Mining Engine Input Modules Analysis Modules Output Modules Mining Results: MCSs Event/ Relationship Event/ Search Relationship System Search System Knowledge base Genome DB Sensor Data Characteristics • Many different formats, types and structures • Different states of processing ( raw, calibrated, derived, modeled or interpreted ) • Enormous volumes • Heterogeneity leads to data usability problems Interchange Technologies: Accessing Heterogeneous Data The Problem DATA FORMAT 1 DATA FORMAT 3 DATA FORMAT 2 FORMAT CONVERTER READER 1 READER 2 APPLICATION The Solution DATA DATA DATA FORMAT 1 FORMAT 2 FORMAT 3 ESML ESML ESML FILE FILE FILE ESML LIBRARY APPLICATION • Earth science data comes in: Different formats, types and structures Different states of processing (raw, calibrated, derived, modeled or interpreted) Enormous volumes • Heterogeneity leads to data usability problems • One approach: Standard data formats Difficult to implement and enforce Can’t anticipate all needs Some data can’t be modeled or is lost in translation The cost of converting legacy data • A better approach: Interchange Technologies Earth Science Markup Language What is ESML? It is a specialized markup language for Earth Science metadata based on XML - NOT another data format. It is a machine-readable and -interpretable representation of the structure, semantics and content of any data file, regardless of data format ESML description files contain external metadata that can be generated by either data producer or data consumer (at collection, data set, and/or granule level) ESML provides the benefits of a standard, self-describing data format (like HDF, HDF-EOS, netCDF, geoTIFF, …) without the cost of data conversion ESML is the basis for core Interchange Technology that allows data/application interoperability ESML complements and extends data catalogs such as FGDC and GCMD by providing the use/access information those directories lack. http://esml.itsc.uah.edu Components of the ESML Interchange Technology DATA FORMAT1 DATA FORMAT2 DATA FORMAT3 OTHER FORMATS ESML FILE ESML FILE ESML ESML FILE FILE ESML SCHEMA ESML LIBRARY ESML EDITOR ESML CONSISTS OF: (1) MARKUPS ESML DATA BROWSER ADaM DATA MINING SYSTEM (1) External description file for dataset or formats (2) RULES FOR THE MARKUPS OTHER APPLICATIONS (2) Rules that govern the description of the data files (3) MIDDLEWARE FOR AUTOMATION (3) Library parses and interprets the description file and figures out how to read the data ESML in Numerical Modeling GOES ESML Skin Temp file Insolation ESML Products file Soundings, ESML file Others Network ESML Library NUMERICAL WEATHER MODELS (MM5, ETA, RAMS) Purpose: 264 263 Chn 5 Temperature (AMSU) Degree Kelvin • Use ESML to incorporate observational data into the numerical models for simulation Scientists can: 265 262 261 260 259 258 257 256 255 200 210 220 230 240 250 260 270 280 Sea Surface Temperature (TMI) Degree Kelvin Prediction 290 300 • Select remote files across the network • Select different observational data to increase the model prediction accuracy Collaboration Tools Technologies to coordinate complex projects CAMEX-4 campaign • Data acquisition and integration from multiple platforms and instruments for quick exploitation • Intra-project communications before, during, and after CAMEX campaigns • Collaborators included NASA, NOAA, USAF, and multiple universities http://camex.msfc.nasa.gov NASA managers review status Coordination Clearinghouse Web-based interface Data management CAMEX-4 Distributed Mission Coordination Experiment PI RDBMS NASA Aircraft Forecasters USAF Aircraft NOAA Aircraft Aircraft Crew: maintenance and report status. Radars Mission Managers Modeling Environment for Atmospheric Discovery (MEAD): Use of the TeraGrid Infrastructure •Argonne National Lab • will develop/adapt a cyberinfrastructure that will enable simulation, datamining, and visualization of hurricanes and storms •Georgia Tech University •Indiana University •Lawrence Berkley National Lab •NCSA •NOAA/FSL •NOAA/NSSL •Northwestern University •Ohio State University • will integrate model and grid workflow management, data management, model coupling, and analysis/mining of large, ensemble datasets. •Oklahoma University •Portland State University •Rice University •Rutgers •UAH •UCAR •University of Wisconsin •University of Minnesota Primary MEAD Software Components • • • • • • • • • WRF Model (Weather Research and Forecasting) ROMS Model (Regional Ocean Modeling System) Coupled WRF/ROMS Model D2K (Data to Knowledge) ADaM (Algorithm Development and Mining System) Visualization Engines (NCAR Graphics, Vis5D, IDV-VisAD, HVR, VTK) netCDF, HDF5, ESML Middleware (Globus, JavaCog, GridFTP) Metadata Catalogue Service Example MEAD Workflow Initial Setup Initial Data and Parameters Model Execution Multiple WRF Models (Weather) Inter-model communications Multiple ROMS Models (Ocean) Initial Data and Parameters Post Run Analysis Model Results Data Mining (ADaM) Model Results Visualization Need the Grid to support the huge computational, data storage and post analysis requirements Linked Environments for Atmospheric Discovery (LEAD) Create for the university community an integrated, scalable framework for use in accessing, preparing, assimilating, predicting, managing, mining/analyzing, and displaying a broad array of meteorological and related information independent of format and physical location. Collaborators: – University of Oklahoma – University of Alabama in Huntsville – UCAR/Unidata – Indiana University – University of Illinois/NCSA – Millersville University – Howard University – Colorado State University LEAD Architecture MyLEAD Portal MyLEAD Virtual Environment Interchange Technologies Workflow Orchestration Semantics for data and services Visualization tools Models Personal Data Space Application Services Data Mining Others… Middleware Data Management Workflow Management Monitoring Grid and Web infrastructure Resource Allocation Scheduling Others… Security national supercomputer facilities pools of workstations tertiary storage clusters scientific instr’mts Distributed Resources Collaborative Environment for Propulsion Research: Rocket Engine Advancement Program 2 • Consortium of propulsion research centers. • • • • Auburn University Purdue University Pennsylvania State University Tuskegee University • • • • University of Alabama in Huntsville University of Tennessee NASA Marshall Space Flight Center NASA Glenn Research Center • Grid configuration will make distributed computational and data resources available to researchers without having to negotiate separate access to each resource. • Linking or integration of multiple distributed experiment steps into a single investigation for more timely results and analysis. • Will rely on the security capabilities of the Grid due to the sensitive nature of the propulsion research. Collaborative Environment for Propulsion Research Cluster(s) Supercomputer REAP2 Grid Portal Test Equipment REAP2 User Portal Data and Results Rocket Engine Advancement Program 2 Evolution of Frameworks for Advanced Applications • Changing Computational Landscape – – – – – GRIDS Clusters Web Services Pervasive Computing On-Board Processing • Middleware for applications on GRID/Clusters – Automate parallelization of mining tasks – Estimate using resource requirements using computational complexity of the algorithms • Federated Model for Mining – Individual components that can be distributed and can execute across different platforms