Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
THE US NATIONAL VIRTUAL OBSERVATORY P2P Data Mining Kirk D. Borne George Mason University [email protected] , http://classweb.gmu.edu/kborne/ with H. Kargupta, S. Arora, K. Bhaduri, K. Das, Tushar, W. Griffin (UMBC), and C. Giannella (Loyola) IVOA Interop – Baltimore – October 2008 Topics • • • • • Distributed vs. P2P Data Mining Science Use Cases P2P Data Mining Project Plans Current Design & Status IVOA GWS Standards IVOA Interop – Baltimore – October 2008 2 Distributed Data Mining (DDM) • DDM comes in 2 types: 1. Distributed Mining of Data 2. Mining of Distributed Data • Type 1 requires sophisticated algorithms that operate with data in situ • Type 2 takes many forms, with data being centralized (in whole or in partitions) or data remaining in place at distributed sites • References: http://www.cs.umbc.edu/~hillol/DDMBIB/ – – C. Giannella, H. Dutta, K. Borne, R. Wolff, H. Kargupta. (2006). Distributed Data Mining for Astronomy Catalogs. Proceedings of 9th Workshop on Mining Scientific and Engineering Datasets, as part of the SIAM International Conference on Data Mining (SDM), 2006. [ http://www.cs.umbc.edu/~hillol/PUBS/Papers/Astro.pdf ] H. Dutta, C. Giannella, K. Borne and H. Kargupta. (2007). Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System. Proceedings of the SIAM International Conference on Data Mining, Minneapolis, USA, April 2007. [ http://www.cs.umbc.edu/~hillol/PUBS/Papers/sdm07.pdf ] IVOA Interop – Baltimore – October 2008 3 P2P Data Mining • P2P Data Mining represents one possible implementation of DDM • P2P has two types: – Task-parallel :: the compute processes are distributed across the nodes – Data-parallel :: the data are distributed across the nodes • References: http://www.cs.umbc.edu/~hillol/DDMBIB/ddmbib_html/DistSys.html – – – S. Banyopadhyay, C. Giannella, U. Maulik, H. Kargupta, S. Datta, and K. Liu. Clustering distributed data streams in peer-to-peer environments. Information Science, 176(14):1952-1985, 2006. [ http://www.cs.umbc.edu/~hillol/PUBS/p2pDM.pdf ] K. Bhaduri, R. Wolff, C. Giannella, H. Kargupta. (2008). Distributed Decision Tree Induction in Peer-to-Peer Systems. Statistical Analysis and Data Mining. Volume 1, Issue 2, pp. 85-103. [http://www.cs.umbc.edu/~hillol/PUBS/Papers/sam08_dtree_bhaduri.pdf ] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H. Kargupta. (2006). Distributed Data Mining in Peer-to-Peer Networks. (Invited submission to the IEEE Internet Computing special issue on Distributed Data Mining), Volume 10, Number 4, pp. 18--26. [ http://www.cs.umbc.edu/~hillol/PUBS/P2PDM.pdf ] IVOA Interop – Baltimore – October 2008 4 Why distributed data mining? Because … Many great astronomical discoveries have come from inter-comparisons of various wavelengths: - Quasars - Gamma-ray bursts - Ultraluminous IR galaxies - X-ray black-hole binaries - Radio galaxies - ... “Just Checking” IVOA Interop – Baltimore – October 2008 5 Some Fundamental Astronomy problems: most of these require VO-accessible distributed data • Some key astronomy problems that can be addressed with distributed data: • • • • • • • • • • • • • Cross-Match objects from different catalogues The distance problem (e.g., Photometric Redshift estimators) Star-Galaxy Separation Cosmic-Ray Detection in images Supernova Detection and Classification Morphological Classification (galaxies, AGN, gravitational lenses, ...) Class and Subclass Discovery (brown dwarfs, methane dwarfs, ...) Dimension Reduction = Correlation Discovery Learning Rules for improved classifiers Classification of massive data streams Real-time Classification of Astronomical Events Clustering of massive data collections Novelty, Anomaly, Outlier Detection in massive databases IVOA Interop – Baltimore – October 2008 6 Sample Astronomy Data Mining Applications: most of these require VO-accessible distributed data – Neural Network for Pixel Classification: Event Detection and Prediction (e.g., Supernova or Cosmic-ray hit?) – Bayesian Network for Object Classification (star or galaxy?) – PCA for finding Fundamental Planes of Galaxy Parameters – PCA (weakest component) for Outlier Detection: anomalies, novel discoveries, new objects – Link Analysis (Association Mining) for Causal Event Detection (e.g., linking optical transients with gamma-ray events) – Clustering analysis: Spatial, Temporal, or any scientific database parameters – Markov models: Temporal mining, classification, and prediction from time series data IVOA Interop – Baltimore – October 2008 7 Class Discovery: feature separation and discrimination of classes across multiple databases • Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf • The separation of classes improves when attributes from disparate databases are chosen to be projected, as in the following star-galaxy discrimination test: Not good IVOA Interop – Baltimore – October 2008 Good 8 Novelty Discovery (Outlier Detection): improved discovery of rare objects across multiple databases IVOA Interop – Baltimore – October 2008 9 Correlation Discovery: Fundamental Plane for 156,000 cross-matched Sloan+2MASS Elliptical Galaxies: plot shows variance captured by first 2 Principal Components as a function of local galaxy density. Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008 Slide Content % of variance captured by PC1+PC2 • • • • Slide content Slide content Slide content low IVOA Interop – Baltimore – October 2008 (Local Galaxy Density) high 10 Our Project Plans • NASA-funded (AISR) project to implement a P2P distributed data mining system • Provide a small number of “useful” data mining algorithms (one-to-one mapping with science use cases): • Clustering :: Class Discovery & Characterization • Outlier detection :: Novelty Discovery • PCA :: Correlation Discovery • Select problems and algorithms that are decomposable: task-parallel and/or dataparallel • Implement system within VO framework IVOA Interop – Baltimore – October 2008 11 Architecture-NASA project (back-end) User user Interface Metadata with Cross matched information Portion of Metadata chosen based on user query IVOA Interop – Baltimore – October 2008 12 IVOA GWS Standards • GWS standards enable access to distributed data and distributed compute resources • Nodes in P2P system individually request distributed data partitions • Workflow is distributed across the P2P compute nodes • P2P activities are stateful & asynchronous • Relevant GWS activities: Security, VOSpace, Asynchronous services, Single Sign-on, Universal Worker Service (UWS), Logging IVOA Interop – Baltimore – October 2008 13 GWS functions required by P2P Data Mining Environment • Acquiring & managing nodes and workspaces (VOSpace) • Single sign-on to nodes (SSO) • Distributing work and metadata to nodes (GRID) • Cone-search and other data requests submitted from compute nodes to data repositories – RESTful services? • Secure stateful asynchronous computations (UWS) – Communicate results between nodes, as required by some DDM algorithms • Recording and sharing results, and demonstrating interoperable multi-database VO science (Logging) IVOA Interop – Baltimore – October 2008 14