Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003 Data Mining “ We are drowning in information, but starving for knowledge.” - John Naisbett What is data mining? – – – – Closely related to knowledge discovery Discovering useful, usually unknown patterns from data Data: a set of facts F (e.g., cases in a database) Pattern: an expression E describing facts in a subset FE 2 Goals of Data Mining Goals – Prediction – Description Domains – Induction, Compression, Querying, Approximation, Search 3 Basic Techniques of Data Mining Basic techniques – Clustering – Association rule discovery – Classification – Sequential pattern discovery – Outlier detection 4 Data Warehouse Architecture Data Mining Algorithm Data Warehouse Data Transformation & Integration Extractor Extractor Extractor … Data source Data source Data source 5 Distributed Data Mining Framework Final Model Local Model Aggregation Local Model Local Model Local Model Data Mining Algorithm Data Mining Algorithm Data Mining Algorithm … Data source Data source Data source 6 Distributed Data Source Definitions Homogeneous – Contain the same set of attributes across distributed data sites Heterogeneous – Define different sets of attributes across distributed data sites 7 Distributed Data Mining Techniques Distributed classifier learning – Meta-learning framework – Distributed learning with knowledge probing Collective data mining Distributed clustering Distributed association rule mining Others 8 Meta-learning Chan , Florida Institute of Technology & Stolfo, Columbia University “base classifiers” and “meta-classifier” Meta-learning rules: voting, arbitrating, and combining Scalability, efficiency, portability, compatibility, adaptivity, extensibility, and effectiveness For heterogeneous data sites, apply bridging methods 9 Meta-learning Framework Meta-level Training Data Training Data Meta-learning (Arbitration and Combining) Learning Algorithm Final Classifier System Prediction Classifier Validation Data Training Data Learning Algorithm Classifier Prediction 10 Distributed Learning with Knowledge Probing Guo & Sutiwaraphun, Imperial College Objective: distributed classification Meta-learning based technique Applied on homogeneous data sites Knowledge probing: to extract descriptive knowledge from a black box model from a new data set whose classes are assigned by the model 11 DLKP (Cont.) Prediction Scheme Final Model Local Model 1 Local Model 2 Local Model 3 Local Model Derivation Local Model Derivation Local Model Derivation … Data source 1 Data source 2 Probing set Probing Strategy Data source k 12 Collective Data Mining (CDM) Kargupta, University of Maryland & Park, Washington State University Objective: predictive data modeling Applied to heterogeneous (vertically partitioned) data sites Foundation: any function can be represented in a distributed fashion using an appropriate set of basis functions (orthonormal) Example: Collective Principal Component Analysis (CPCA) 13 CDM Framework Step 1: Generate approximate orthonormal basis coefficients at each local site Step 2: Move a chosen sample of data sets from each site to a single site; Generate approximate basis coefficients corresponding to non-linear cross terms Step 3: Combine the local models; Transform it into user described representation; Output the model 14 Distributed Clustering Sources from parallel center-based clustering algorithms, such as k-means, etc Applied on homogeneous scenarios Two basic approaches – Approximate the underlying distance measure by aggregation – Provide the exact measure by data broadcasting 15 Distributed Association Rule Mining Two main approaches – Count Distribution (CD) • data is partitioned homogeneously into several data sites – Data Distribution (DD) • maximizing parallelism 16 Applications of Distributed Data Mining Credit card fraud detection Intrusion detection Information retrieval from Internet Ad hoc sensor networks 17 Challenges of Distributed Data Mining Real-time distributed data mining Adaptive to changing environment, new data, new pattern 18