Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Virtual Observatory Yanxia Zhang National Astronomical Observatories,CAS DEC.2 2004 1 Outline Why What How 2 Astronomy is a Astronomy is Facing Major “Data Avalanche”: Facing a Major Data Avalanche Multi-Terabyte Sky Surveys and Archives (Soon: Multi-Petabyte), Billions of Detected Sources, Hundreds of Measured Attributes per Source … 3 Necessity Is the Mother of Invention Understanding of Complex Astrophysical Phenomena Requires Complex and Information-Rich Data Sets, and the Tools to Explore them … … This Will Lead to a Change in the nature of the Astronomical Discovery Process … … Which Requires A New Research Environment for Astronomy: VO 4 DM: Confluence of Multiple Disciplines Database system, Data warehouse, OLAP ML&AI Information science statistics DM Visualization Other disciplines 5 What is DM? The search for interesting patterns, in large databases, that were collected for other applications, using machine learning algorithms, high-performance computers and others methods for science and society! 6 Data Mining: A KDD Process Data mining: the core of Pattern Evaluation knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 7 Data Mining Increasing potential to support decisions End User Kwonledge Discovery Data Presentation Visualization Techniques Data Mining Information Discovery scientist Analyst Data Analyst Data Exploration OLAP, MDA, Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts DBA Data Sources (Paper, Files, Information Providers, Database Systems, OLTP) 8 Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases Filtering Data Warehouse 9 The ratio of every DM step 60 50 40 30 20 10 0 Decide target Data preparing Data mining Evaluation 10 DM: On What Kind of Data? Relational databases Data warehouses Transactional databases Advanced DB systems and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW 11 Data Mining Functionality Concept description Association Classification and Prediction Clustering Time-series analysis Other pattern-directed or statistical analysis 12 Taking a Broader View: The Observable Parameter Space Flux Non-EM … Morphology / Surf.Br. Time Wavelength Polarization Proper motion RA Dec What is the coverage? Where are the gaps? Where do we go next? Along each axis the measurements are characterized by the position, extent, sampling and resolution. All astronomical measurements span some volume in this parameter space. 13 How and Where are Discoveries Made? Conceptual Discoveries: e.g., Relativity, QM, Brane World, Inflation … Theoretical, may be inspired by observations Phenomenological Discoveries: e.g., Dark Matter, QSOs, GRBs, CMBR, Extrasolar Planets, Obscured Universe … Empirical, inspire theories, can be motivated by them New Technical Capabilities IT/VO Observational Discoveries Theory (VO) Phenomenological Discoveries: Pushing along some parameter space axis VO useful Making new connections (e.g., multi-) VO critical! Understanding of complex astrophysical phenomena requires complex, information-rich data (and simulations?) 14 Exploration of observable parameter spaces and searches for rare or new types of objects 15 But Sometimes You Find a Surprise… 16 Precision Cosmology and LSS Better matching of theory and observations Clustering on a clustered background Clustering with a nontrivial topology DPOSS Clusters (Gal et al.) LSS Numerical Simulation (VIRGO) 17 Exploration of the Time Domain: Optical Transients A Possible Example of an “Orphan Afterglow” (GRB?) discovered in DPOSS: an 18th mag transient associated with a 24.5 mag galaxy. At an estimated z ~ 1, the observed brightness is ~ 100 times that of a SN at the peak. DPOSS Keck Or, is it something else, new? 18 Exploration of the Time Domain: Faint, Fast Transients (Tyson et al.) 19 Exploring the Low Surface Brightness (Low Contrast) Universe Comparison between HI, Ha, and 100m Diffuse Emission DPOSS red image Brunner et al. IRAS 100 Micron Image 20 Background Enhancement Technique demonstrated on two known M31 dwarf spheroidals (Brunner et al.) 21 Data Mining in the Image Domain: Can We Discover New Types of Phenomena Using Automated Pattern Recognition? (Every object detection algorithm has its biases and limitations) 22 An OLAM Architecture Mining query Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Filtering&Integration Database API Filtering Layer1 Data cleaning Databases Data Data integration Warehouse Data Repository 23 View of Warehouses and Hierarchies Importing data Table Browsing Dimension creation Dimension browsing Cube building Cube browsing 24 Selecting a Data Mining Task Major data mining functions: Summary (Characterization) Association Classification Prediction Clustering Time-Series Analysis 25 Mining Characteristic Rules Characterization: Data generalization/summarization at high abstraction levels. An example query: Find a characteristic rule for Cities from the database ‘CITYDATA' in relevance to location, capita_income, and the distribution of count% and amount%. 26 Browsing a Data Cube Powerful visualization OLAP capabilities Interactive manipulation 27 Visualization of Data Dispersion: Boxplot Analysis 28 Mining Association Rules ( Table Form ) 29 Association Rule in Plane Form 30 Association Rule Graph 31 Mining Classification Rules 32 Prediction: Numerical Data 33 Prediction: Categorical Data 34 DMiner: Architecture Graphic User Interface Characterizer Cluster Analyzer Comparator Associator Classifier Future FutureModules Modules Database and Cube Server Radio DB Infrared DB Optical DB ……. DB 35 A System Prototype for MultiMedia Data Mining Simon Fraser University WWW Image features Internet Domain Hierarchy Keywords Pre-built Concept Hierarchies for colour, texture, format, etc. Metadata WordNet Pre-processing Pattern discoveries Keyword Hierarchy Data Cubes and Numeric Hierarchies Real-time Interaction 36 Media Descriptors WWW Discoveries Database Mining Engine Data Cube Simon Fraser University Dimensions 37 WebLogMiner Architecture Web log is filtered to generate a relational database A data cube is generated form database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge Web log Database 1 Data Cleaning Data Cube 2 Data Cube Creation Knowledge Sliced and diced cube 3 OLAP 4 Data Mining 38 VO: Conceptual Architecture User Discovery tools Analysis tools Gateway Data Archives 39 Conclusion ◆ Development and application of DM in astronomy; ◆ Automated DM, visulized DM and audio DM; ◆ Integrate VO and DM. The next golden age of discovery in astronomy come eariler! 40 Q&A? Thank you !!! 41