Download kborne-interop2008

THE US NATIONAL VIRTUAL OBSERVATORY P2P Data Mining Kirk D. Borne George Mason University [email protected] , http://classweb.gmu.edu/kborne/ with H. Kargupta, S. Arora, K. Bhaduri, K. Das, Tushar, W. Griffin (UMBC), and C. Giannella (Loyola) IVOA Interop – Baltimore – October 2008 Topics • • • • • Distributed vs. P2P Data Mining Science Use Cases P2P Data Mining Project Plans Current Design & Status IVOA GWS Standards IVOA Interop – Baltimore – October 2008 2 Distributed Data Mining (DDM) • DDM comes in 2 types: 1. Distributed Mining of Data 2. Mining of Distributed Data • Type 1 requires sophisticated algorithms that operate with data in situ • Type 2 takes many forms, with data being centralized (in whole or in partitions) or data remaining in place at distributed sites • References: http://www.cs.umbc.edu/~hillol/DDMBIB/ – – C. Giannella, H. Dutta, K. Borne, R. Wolff, H. Kargupta. (2006). Distributed Data Mining for Astronomy Catalogs. Proceedings of 9th Workshop on Mining Scientific and Engineering Datasets, as part of the SIAM International Conference on Data Mining (SDM), 2006. [ http://www.cs.umbc.edu/~hillol/PUBS/Papers/Astro.pdf ] H. Dutta, C. Giannella, K. Borne and H. Kargupta. (2007). Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System. Proceedings of the SIAM International Conference on Data Mining, Minneapolis, USA, April 2007. [ http://www.cs.umbc.edu/~hillol/PUBS/Papers/sdm07.pdf ] IVOA Interop – Baltimore – October 2008 3 P2P Data Mining • P2P Data Mining represents one possible implementation of DDM • P2P has two types: – Task-parallel :: the compute processes are distributed across the nodes – Data-parallel :: the data are distributed across the nodes • References: http://www.cs.umbc.edu/~hillol/DDMBIB/ddmbib_html/DistSys.html – – – S. Banyopadhyay, C. Giannella, U. Maulik, H. Kargupta, S. Datta, and K. Liu. Clustering distributed data streams in peer-to-peer environments. Information Science, 176(14):1952-1985, 2006. [ http://www.cs.umbc.edu/~hillol/PUBS/p2pDM.pdf ] K. Bhaduri, R. Wolff, C. Giannella, H. Kargupta. (2008). Distributed Decision Tree Induction in Peer-to-Peer Systems. Statistical Analysis and Data Mining. Volume 1, Issue 2, pp. 85-103. [http://www.cs.umbc.edu/~hillol/PUBS/Papers/sam08_dtree_bhaduri.pdf ] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H. Kargupta. (2006). Distributed Data Mining in Peer-to-Peer Networks. (Invited submission to the IEEE Internet Computing special issue on Distributed Data Mining), Volume 10, Number 4, pp. 18--26. [ http://www.cs.umbc.edu/~hillol/PUBS/P2PDM.pdf ] IVOA Interop – Baltimore – October 2008 4 Why distributed data mining? Because … Many great astronomical discoveries have come from inter-comparisons of various wavelengths: - Quasars - Gamma-ray bursts - Ultraluminous IR galaxies - X-ray black-hole binaries - Radio galaxies - ... “Just Checking” IVOA Interop – Baltimore – October 2008 5 Some Fundamental Astronomy problems: most of these require VO-accessible distributed data • Some key astronomy problems that can be addressed with distributed data: • • • • • • • • • • • • • Cross-Match objects from different catalogues The distance problem (e.g., Photometric Redshift estimators) Star-Galaxy Separation Cosmic-Ray Detection in images Supernova Detection and Classification Morphological Classification (galaxies, AGN, gravitational lenses, ...) Class and Subclass Discovery (brown dwarfs, methane dwarfs, ...) Dimension Reduction = Correlation Discovery Learning Rules for improved classifiers Classification of massive data streams Real-time Classification of Astronomical Events Clustering of massive data collections Novelty, Anomaly, Outlier Detection in massive databases IVOA Interop – Baltimore – October 2008 6 Sample Astronomy Data Mining Applications: most of these require VO-accessible distributed data – Neural Network for Pixel Classification: Event Detection and Prediction (e.g., Supernova or Cosmic-ray hit?) – Bayesian Network for Object Classification (star or galaxy?) – PCA for finding Fundamental Planes of Galaxy Parameters – PCA (weakest component) for Outlier Detection: anomalies, novel discoveries, new objects – Link Analysis (Association Mining) for Causal Event Detection (e.g., linking optical transients with gamma-ray events) – Clustering analysis: Spatial, Temporal, or any scientific database parameters – Markov models: Temporal mining, classification, and prediction from time series data IVOA Interop – Baltimore – October 2008 7 Class Discovery: feature separation and discrimination of classes across multiple databases • Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf • The separation of classes improves when attributes from disparate databases are chosen to be projected, as in the following star-galaxy discrimination test: Not good IVOA Interop – Baltimore – October 2008 Good 8 Novelty Discovery (Outlier Detection): improved discovery of rare objects across multiple databases IVOA Interop – Baltimore – October 2008 9 Correlation Discovery: Fundamental Plane for 156,000 cross-matched Sloan+2MASS Elliptical Galaxies: plot shows variance captured by first 2 Principal Components as a function of local galaxy density. Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008 Slide Content % of variance captured by PC1+PC2 • • • • Slide content Slide content Slide content low IVOA Interop – Baltimore – October 2008 (Local Galaxy Density) high 10 Our Project Plans • NASA-funded (AISR) project to implement a P2P distributed data mining system • Provide a small number of “useful” data mining algorithms (one-to-one mapping with science use cases): • Clustering :: Class Discovery & Characterization • Outlier detection :: Novelty Discovery • PCA :: Correlation Discovery • Select problems and algorithms that are decomposable: task-parallel and/or dataparallel • Implement system within VO framework IVOA Interop – Baltimore – October 2008 11 Architecture-NASA project (back-end) User user Interface Metadata with Cross matched information Portion of Metadata chosen based on user query IVOA Interop – Baltimore – October 2008 12 IVOA GWS Standards • GWS standards enable access to distributed data and distributed compute resources • Nodes in P2P system individually request distributed data partitions • Workflow is distributed across the P2P compute nodes • P2P activities are stateful & asynchronous • Relevant GWS activities: Security, VOSpace, Asynchronous services, Single Sign-on, Universal Worker Service (UWS), Logging IVOA Interop – Baltimore – October 2008 13 GWS functions required by P2P Data Mining Environment • Acquiring & managing nodes and workspaces (VOSpace) • Single sign-on to nodes (SSO) • Distributing work and metadata to nodes (GRID) • Cone-search and other data requests submitted from compute nodes to data repositories – RESTful services? • Secure stateful asynchronous computations (UWS) – Communicate results between nodes, as required by some DDM algorithms • Recording and sharing results, and demonstrating interoperable multi-database VO science (Logging) IVOA Interop – Baltimore – October 2008 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download kborne-interop2008