Download Data Mining in the Multidimensional Parameter Space

Data Mining in the Multidimensional Parameter Space Yanxia Zhang National Astronomical Observatories,CAS Nov.27 2003 1 Outline  Why and What DM Technology  Future Directions 2 Why Necessity Is the Mother of Invention Data avalanche VO DM&KDD IRAS 25m 2MASS 2m DSS Optical IRAS 100mWENSS 92cmNVSS 20cm GB 6cm ROSAT ~keV 3 What  DM (KDD):  Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases  Alternative names :  Data mining: a misnomer?  Knowledge discovery in databases (KDD: SIGKDD), knowledge extraction, data archeology, data dredging, information harvesting, business intelligence, etc. 4 Taxonomy of DM In large scientific databases, DM in two flavors: – Event-based mining – Relationship-based mining 5 Event-Based Mining for Science Event-based mining is based upon events or trends in data.  Known events / known algorithms - use existing physical models (descriptive models) to locate known phenomena of interest either spatially or temporally within a large database.  Known events / unknown algorithms - use pattern recognition and clustering properties of data to discover new observational (in our case, astrophysical) relationships among known phenomena.  Unknown events / known algorithms - use expected physical relationships (predictive models) among observational parameters of astrophysical phenomena to predict the presence of previously unseen events within a large complex database.  Unknown events / unknown algorithms - use thresholds or trends to identify transient or otherwise unique ("one-of-a-kind") events and therefore to discover new phenomena. 6 Relationship-Based Mining for Science Relationship-based mining is based on associations.  Spatial associations -- identify events (astronomical objects) at the same location in the sky.  Temporal associations -- identify events occurring during the same or related periods of time.  Coincidence associations -- use clustering techniques to identify events that are co-located within a multi-dimensional parameter space. 7 Science Requirements for DM  Cross-Identification - refers to the classical problem of associating the source list in one database to the source list in another.  Cross-Correlation - refers to the search for correlations, tendencies, and trends between physical parameters in multi-dimensional data, usually across databases.  Nearest-Neighbor Identification - refers to the general application of clustering algorithms in multi-dimensional parameter space, usually within a database.  Systematic Data Exploration - refers to the application of the broad range of event-based and relationship-based queries to a database in the hope of making a serendipitous discovery of new objects or a new class of objects. 8 Outline  Why and What  DM Technology  Future Directions 9 DM: Confluence of Multiple Disciplines Database system, Data warehouse, OLAP ML&AI Information science statistics DM Visualization Other disciplines 10 DM: A KDD Process  Data mining: the core of Pattern Evaluation knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 11 DM Functionality  Concept description: Characterization and Comparison:  Generalize, summarize, and possibly contrast data characteristics, e.g., stars vs. galaxies.  Association:  From association, correlation, to causality.  finding rules like “stars  point sources”.  Classification and Prediction:  Classify data based on the values in a classifying attribute, e.g., classify objects based on spectra, or classify galaxies and stars based on images.  Predict some unknown or missing attribute values based on other information. 12 DM Functionality (Cont.)  Clustering:  Group data to form new classes, e.g., cluster spectra data to find distribution patterns.  Time-series analysis:  Trend and deviation analysis: Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., variable stars.  Similarity-based pattern-directed analysis: Find and characterize user-specified patterns in large databases.  Cyclicity/periodicity analysis: Find segment-wise or total cycles or periodic behaviours in time-related data.  Other pattern-directed or statistical analysis: 13 DM: On What Kind of Data?  Relational databases  Data warehouses  Transactional databases  Advanced DB systems and information repositories  Object-oriented and object-relational databases  Spatial databases  Time-series data and temporal data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW 14 Challenges in DM  Mining methodology issues  Mining different kinds of knowledge in databases.  Interactive mining of knowledge at multiple levels of abstraction.  Incorporation of background knowledge  DM query languages and ad-hoc DM.  Expression and visualization of DM results.  Handling noise and incomplete data  Pattern evaluation: the interestingness problem.  Performance issues:  Efficiency and scalability of DM algorithms.  Parallel, distributed and incremental mining methods. 15 Challenges in DM (Cont.)  Issues related to the variety of data types:  Handling relational and complex types of data  Mining information from heterogeneous databases and global information systems.  Issues related to applications and social impacts:  Application of discovered knowledge. – Domain-specific data mining tools – Intelligent query answering – Process control and decision making.  Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem.  Protection of data security and integrity. 16 Mining Association Rules  Assocation rule mining:  Finding associations or correlations among a set of items or objects in transaction databases, relational databases, and data warehouses.  Applications:  Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, etc.  Examples. form: LHS RHS [support, confidence].  buys(x, diapers)  buys(x, beers) [0.5%, 60%]  major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]  Rule 17 Methods for Mining Associations        The Apriori principle: Any subset of a frequent itemset must be frequent.  ( Agrawal & Srikant’94, Mannila, Klementen, et al’94) Partition Technique:(Savasere, Omiecinski, Navathe’95) Sampling techique (Toivonen’96) Multi-level or generalized association (Agrawal & Srikant’95, Han & Fu’95) Quantitative association rule mining (Srikant & Agrawal’96, Lent et al.’97, Miller’97). Constraint-based or query-based association (Ng, et al’98, Tsur et al’98) From association to correlation (Brin et al’97) 18 Classification  Data categorization based on a set of training objects.  Applications: stars,galaxies,AGN classification etc.  Example: classify AGN and provide the symptoms which describe each class or subclass.  The classification task: Based on the features present in the class_labeled training data, develop a description or model for each class. It is used for  classification of future test data,  better understanding of each class, and  prediction of certain properties and behaviors. 19 Three Schemes in Classification  Knowledge to be mined:  Summarization (characterization), comparison, association, classification, clustering, trend, deviation and pattern analysis, etc.  Mining knowledge at different abstraction levels: primitive level, high level, multiple-level, etc.  Databases to be mined:  Relational, transactional, object-oriented, objectrelational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, etc.  Techniques adopted:  Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. 20 Major Classification Methods  Decision tree-based classification:  Training set vs test set or cross-validation  Overfitting problem and tree pruning  Boosting techniques.  Bayesian classification:  Naïve Bayesian classification  Bayesian belief networks  Boosting techniques (e.g., AdaBoosting).  Neural network approach:  Multi-layer networks and back-propagation.  Genetic algorithms:  Genetic operators and fitness function selection. 21 Predictive Modeling in Databases  Predictive modeling: Predict data values or construct     generalized linear models based on the database data. One can only predict value ranges or category distributions. Method outline:  Minimal generalization  Attribute relevance analysis  Generalized linear model construction  Prediction. Determine the major factors which influence the prediction.  Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc. Multi-level prediction: drill-down and roll-up analysis. 22 Data Clustering Analysis  Clustering:  Partitioning a set of data (or objects) into a set of classes, called clusters, such that members of each class sharing some interesting common properties.  High quality clusters:  the intra-class similarity is high.  the inter-class similarity is low.  Measuring data clustering quality  Distance functions 23 Three Categories of Clustering Techniques  Partitioning-based: Basically enumerate various partitions and then score them by some criterion. K-means, K-medoids, etc.  Hierarchy-based: Create a hierarchical decomposition of the set of data (or objects) using some criterion.  Model-based: A model is hypothesized for each of the clusters Find the best fit of that model to each other. E.g., Bayesian classification (AutoClass), Cobweb. 24 Database Clustering Methods  CLARANS (Ng & Han’94):  An extension to k-medoid algorithm based on randomized search.  BIRCH (Zhang et al’96):  CF tree (a balanced tree structure).  DBSCAN (EKXS96):  connects regions of sufficiently high desity into clusters.  STING (WYM97):  A hierarchical cell structure that store statistical information.  CLIQUE (Agrawal et al’98):  Cluster high dimensional data. 25 Time-Series DM  Trend and deviation analysis  Find trend (data evolution regularity) and deviations.  Regression analysis, visualization techniques.  Subsequence analysis: similarity search  Subsequence matching: normalization + matching  Template specification: shape and macro specification.  Sequential pattern analysis  Sequential association rules  Periodicity analysis  full periods vs. partial periods, cyclic association rules. 26 Similarity Search in DM  Faloutsos et al. (1994) :  Extract features from each window  Fourier Transform & R*-tree structure.  Agrawal et al. (1995) :  Amplitude scaling, offset translation  Distance is determined from the sequence envelopes  Agrawal et al. (1995) :  SDL pattern language to encode queries about “shapes”  Jagadish et al. (1997) :  domain-independent framework  “find all objects that are similar to some objects in class A and are not similar to any object in class B” 27 Periodic Pattern Search in TimeRelated Data Sets  Full cycle analysis:  Fourier transformation, other statistical analysis methods  Fragment-wise cyclic behavior analysis:  Example. Jack reads NY Times at every 9:00am.  Given (natural) periods vs. arbitray periods.  A data cube and OLAP-based technique: (Han, Gong and Yin’98)  Cyclic association rules:  Associations which form cycles.  Cyclic Association Rules (B. Özden, S. Ramawamy, A. Silberschatz, 1998) 28 Conclusions  Data warehouse: An industry trend  DW stores a huge amount of subject-oriented, cleansed, integrated, consolidated, time-related data.  OLAP provides an interactive data analysis environment  OLAM: Integration of mining with OLAP  Take advantages of data warehouse infrastructure.  From batch mining to interactive, multi-dimensional mining  Many interesting research and implementation issues  Database mining and warehouse mining are both important directions to pursue. 29 Outline Why and What DM Technology Future Directions 30 Future Work on DM Research  Integration with data warehouse, OLAP, and relational technology  Scalability: efficient algorithms, parallel/distributed and incremental mining  Ad-hoc mining query language and its optimization  Multiple, integrated DM functions and methods  Mining on new kinds of data: time-series data, text, multimedia, spatial and Web  Visual DM and knowledge visualization  Application exploration  Interactive, exploratory DM environment 31 Integration with Data Warehouse and OLAP Technology  Data warehouse: A strong industry trend  huge amount of subject-oriented, cleansed, integrated, consolidated, time-related data are stored in data warehouses  OLAP provides an interactive data analysis environment  Integrate mining with OLAP leads to multiple dimensional DM:  On-Line Analytical Mining (OLAM :-). 32 Efficiency and Scalability in DM  Efficient algorithms in every DM function  Class description: Summarization and comparison  Classification and prediction  Clustering  Time-series and trend analysis  Real-time, fast response in exploratory DM  Progressive, multiple precision data analysis  Parallel and distributed DM algorithms  Incremental DM methods 33 Why Parallel and Distributed DM?  Massive amounts of data sets:  From mega-bytes to giga- and tera-bytes.  Costly data mining algorithms:  Association, classification, clustering, prediction.  Data in applications are geographically distributed.  Parallel and networked computers are widely available. 34 DM Query Optimization  Ad-hoc DM query language: DM SQL.  DM query optimization:  How to carve a DM view?  How to push user-specified rule constraints?  How to integrate interestingness measures in mining?  Interactivity of DM: New challenge. 35 Multiple, Integrated DM Functions and Methods  Multiple mining functions  Concept description: characterization and discrimination  Classification and prediction  Clustering  Association and correlation analysis  Multiple mining methods  Statistical approaches  Machine learning approaches  Neural network approach  Other approaches: mathematical models, etc. 36 Mining Complex Types of Data  Text data mining:  Library database, e-mails, book stores, Web pages.  Spatial data mining:  geographic information systems, engineering databases, medical image database.  Multimedia data mining:  image and video/audio databases.  Web mining:  unstructured and semi-structured data  Web access pattern analysis: easily doable. 37 Visual DM  The power of visual comprehension:  A picture = a thousand words  Pattern recognition and exploratory mining  Visual mining techniques:  Data visualization  Integration with other mining methods  Visual representation of knowledge  Charts, graphs, trees, curves, cubes  Multi-dimensional representation: color, shape, texture, gray-level, etc. 38 Exploration of DM Applications  Need more success stories:  Insurance and market analysis, NBA strategy analysis.  Most current DM systems are lack of a “thick semantic layer”  like the early relational database systems without application software.  Customized data mining systems:  Market analysis DM systems  Insurance and customer analysis systems 39 Towards an Integrated, Exploratory DM Environment  Exploratory DM:  Interactive, user-centered, exploratory mining process  High performance and fast response  Integrated multiple DM functions and methods:  Try different approaches to see which one is better  Try different functions to see which patterns are more interesting  Automated mining and interactive mining: not too far apart! 40 Towards VO-based DM Success of DM Success of VO Hope the day comes earlier 41 Thank you !!! 42

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining in the Multidimensional Parameter Space