Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering (joint work with Liang Chen, Wei Du, Leo Glimcher, Ruoming Jin, Xiaogang Li, Swarup Sahoo, Li Weng, and Xuan Zhang) (Funded by ACI-9733520, EIA-9703088, ACI9982087, ACI-0130437, EIA-0203846, ACI-0234273, DoD PET program) Data-Driven Applications Becoming increasingly important Can be extremely hard to develop for a gridenvironment We need to focus on end-users who have used Matlab / SQL like systems for data retrieval and analysis Some issues to consider Different data layouts and formats Flexibly exploit different forms of parallelism Adapting to available resources Research Projects Automatic Data Virtualization FREERIDE (Framework for Rapid Implementation of Datamining Engines) OGSA based Support for processing distributed streams in a grid environment Self Adaptation to meet real-time constraints Compiler-based front-end to DataCutter High-level specification of a parallel data mining algorithm Flexibly exploit different forms of parallelism GATES (Grid-based AdapTive Execution on Streams) XML-based high-level abstractions and use of XQuery SQL-based front-end for the STORM system Includes support for program adaptation More details through four student posters this afternoon Automatic Data Virtualization Data virtualization refers to an abstract view of data for access and processing Data Services are methods that implement a virtual view of data Our focus: using compiler techniques to automatically generate data services to support data virtualization Two separate ongoing implementations Using XML Schema based high-level abstractions and XQuery (ICS 2003, LCPC 2003, DBPL 2003, prior compiler work in ICS 2002, PACT 2001) Supporting SQL front-end for data subsetting operations (jointly with Saltz, Kurc, Catalyurek, et al.) Project Overview XQuer ??? y HDF5 NetCDF TEXT XML RMDB …. System Architecture External Schema XML Mapping Service logical XML schema physical XML schema Compiler XQuery/XPath C++/C SQL-Based Front-end System Architecture SELECT * FROM IPARS WHERE RID in (0,6,26,27) AND TIME>1000 AND TIME<1100 AND SOIL>0.7 AND SPEED(OILVX, OILVY, OILVZ)<30.0; Common operations: Subsetting, filtering, user defined filtering FREERIDE Overview Framework for Rapid Implementation of Datamining engines Demonstrated for a variety of standard mining algos Targets distributed memory parallelism, shared memory parallelism, and combination Can be used as basis for scalable grid-based data mining implementations Developed on top of Active Data Repository (ADR) from Saltz’s group at Maryland Publications: SDM 01,02,03,Sigmetrics 02, Ipdps 04, TKDE 04 Key Observation from Mining Algorithms Most popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware Parallelism of different forms and execution on disk-resident datasets While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } Applications of FREERIDE Apriori and FP-tree based association mining K-means and EM Clustering distributed memory, shared memory, combination Nearest-neighbor search RainForest-based decision tree construction distributed memory, shared memory, combination shared memory A new decision tree algorithms – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) distributed memory, shared memory, combination Applying FREERIDE for Scientific Data Mining Joint work with Machiraju and Parthasarathy Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. A feature is a region of interest in a dataset A suite of algorithms for extracting and tracking features FREERIDE forms a basis for supporting high-level interfaces Data Parallel Java – lcpc 2002, IPDPS 2003 Matlab / mining operators – planned in the future GATES Grid-based AdapTive Execution on Streams Targets (distributed) processing of (distributed) data streams Built on OGSA model Self adaptation to meet realtime constraint on processing GATES: Motivation Many applications involve high-volume data streams Data from large scale experiments / simulations Digitized images from a movie camera Network traffic Data may arise from distributed sources Analysis / consumption of results may be distributed Many users wanting different analyses/results Insufficient compute power at one site Self Adaptation in GATES Goal: Achieve the best accuracy with available resources, subject to real-time constraint GATES approach: Programmer exposes certain parameters in processing of each stage Examples include: rate of sampling, size of summary structure Programmer also specifies direction of sensitivity e.g. larger summary structure means more computation/communication Parameters adjusted at runtime Currently based upon size of buffers: signal previous stage to become faster/slower if buffer too small / too large Future possibilities: use profiling / performance models … Summary Application development in a grid environment is hard Need novel runtime techniques and middleware Innovative applications of compiler technology can help Equipment needs: Need a controlled distributed environment Need high-bandwidth connectivity – need to simulate High rate of data arrival External clients with ability to receive data at high rates Scaling work to systems with Tera-bytes of storage