Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan 1 Outline • Introduction • Geoscientific Data Modeling • Geoscientific Algebraic Operators • Physical Data Model • Parallel Query Execution • Automatic Query Execution • Heterogeneous Distributed Data Access • Implementations and Experiences • Conclusion • References 2 Introduction • Geoscience studies produce a tremendous amount of raw data • Involves extracting interesting geoscientific phenomena not observed directly from raw datasets • Cyclone tracks - trajectories traveled along low-pressure areas over time, that can be extracted from a sea-level pressure dataset • Data mining in business applications and Geoscientific feature extraction involve sieving through large volumes of isolated events and data to locate salient patterns • A database query processing problem in order to take advantage of automatic query optimization, parallelization techniques • Conquest - an extensible parallel geoscientific query processing system 3 Geoscientific Data Model Example Geographic Data Field 4 Geoscientific Data Model • A field - which associates parameter values with cells in a multidimensional coordinate space • Cells can be of various geometric object types • The type of cells and the coordinate space they lie in is determined by the Coordinate space • Values for the cells lie in a multidimensional variable space • Variable Attributes -The type of values associated with a cell in the coordinate space • A cell record - a cell and the variable value associated with it • Cell coverage - the set of distinct cells in the coordinate space for which variable values are recorded 5 Geoscientific Algebraic Operators • A base set of general purpose logical field data manipulation operators. Users may introduce operators based on application specific algorithms • Set-Oriented Relational operators - Selection, Projection, Cartesian Product, Union, Intersection, Set Difference, Join • Sequence-Oriented Operators • Grouping Operators - Nest and Unnest • Space Conversion Operators 6 Physical Data Model Nesting of a Data Field 7 Parallel Query Execution • Parallelization Techniques are used to remove bottlenecks in I/O and computation and improve query performance Pipelining Processing or Dataflow Parallelism Partitioning or Intra-Operator Parallelism Multicasting 8 Query Parallelization • Window of Relevance - Maximum length of time between arrival of an object and the time it ceases to have an effect on the execution state of the operator Instantaneous Known Random but Bounded Fixed Windows 9 Heterogeneous Distributed Data Access • Only a small percentage of data is analyzed, due to unavailable storage, bandwidth and difficulty in integrating distributed datasets • Conquest supports datasets both through distributed object interface and a repository- specific scanner operator, as accessing data from distributed objects eliminates opportunities for query capability of data repositories to optimize query evaluation 10 Implementations and Experiences • Ported to run IBM SP1, SP2 and Intel Paragon • Has been used for the past five years for exploratory data analysis and data mining of spatio-temporal phenomena produced at UCLA and also for extraction and analysis of cyclonic activity, blocking features, and oceanic warm pools. Number of upward wave propagation trajectories between 500mb and 11 50mb levels extracted per year Implementations … (Contd.) Number of upward wave propagation trajectories between 500mb and 50mb at different latitudes 12 Conclusion • Conquest - geoscientific data model that applies distributed and parallel database query processing to handle computationally expensive data mining queries on massive datasets. • Helps analyze the large volumes of data to extract the necessary information • Query Optimization emphasizes parallelization and optimal data access • Future Work - This system is currently being integrated as part of a larger environment. 13 References • E.C. Shek, R.R. Muntz, E. Mesrobian, and K. Ng, "Scalable Exploratory Data Mining of Distributed Geoscientific Data", KDD, 1996 • E.C. Shek, E. Mesrobian, and R.R. Muntz, "On Heterogeneous Distributed Geoscientific Query Processing", Feb. 1996 • F. Fabbrocino, E.C. Shek, R.R. Muntz, “ The Design and Implementation of the Conquest Query Execution Environment”, July. 1997 • E. Mesrobian, et al…, "Exploratory Data Mining and Analysis Using Conquest", May 1995 14