Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003 Data Mining “ We are drowning in information, but starving for knowledge.” - John Naisbett What is data mining? – – – – Closely related to knowledge discovery Discovering useful, usually unknown patterns from data Data: a set of facts F (e.g., cases in a database) Pattern: an expression E describing facts in a subset FE 2 Goals of Data Mining Goals – Prediction – Description Domains – Induction, Compression, Querying, Approximation, Search 3 Basic Techniques of Data Mining Basic techniques – Clustering – Association rule discovery – Classification – Sequential pattern discovery – Outlier detection 4 Data Warehouse Architecture Data Mining Algorithm Data Warehouse Data Transformation & Integration Extractor Extractor Extractor … Data source Data source Data source 5 Distributed Data Mining Framework Final Model Local Model Aggregation Local Model Local Model Local Model Data Mining Algorithm Data Mining Algorithm Data Mining Algorithm … Data source Data source Data source 6 Distributed Data Source Definitions Homogeneous – Contain the same set of attributes across distributed data sites Heterogeneous – Define different sets of attributes across distributed data sites 7 Distributed Data Mining Techniques Distributed classifier learning – Meta-learning framework – Distributed learning with knowledge probing Collective data mining Distributed clustering Distributed association rule mining Others 8 Meta-learning Chan , Florida Institute of Technology & Stolfo, Columbia University “base classifiers” and “meta-classifier” Meta-learning rules: voting, arbitrating, and combining Scalability, efficiency, portability, compatibility, adaptivity, extensibility, and effectiveness For heterogeneous data sites, apply bridging methods 9 Meta-learning Framework Meta-level Training Data Training Data Meta-learning (Arbitration and Combining) Learning Algorithm Final Classifier System Prediction Classifier Validation Data Training Data Learning Algorithm Classifier Prediction 10 Distributed Learning with Knowledge Probing Guo & Sutiwaraphun, Imperial College Objective: distributed classification Meta-learning based technique Applied on homogeneous data sites Knowledge probing: to extract descriptive knowledge from a black box model from a new data set whose classes are assigned by the model 11 DLKP (Cont.) Prediction Scheme Final Model Local Model 1 Local Model 2 Local Model 3 Local Model Derivation Local Model Derivation Local Model Derivation … Data source 1 Data source 2 Probing set Probing Strategy Data source k 12 Collective Data Mining (CDM) Kargupta, University of Maryland & Park, Washington State University Objective: predictive data modeling Applied to heterogeneous (vertically partitioned) data sites Foundation: any function can be represented in a distributed fashion using an appropriate set of basis functions (orthonormal) Example: Collective Principal Component Analysis (CPCA) 13 CDM Framework Step 1: Generate approximate orthonormal basis coefficients at each local site Step 2: Move a chosen sample of data sets from each site to a single site; Generate approximate basis coefficients corresponding to non-linear cross terms Step 3: Combine the local models; Transform it into user described representation; Output the model 14 Distributed Clustering Sources from parallel center-based clustering algorithms, such as k-means, etc Applied on homogeneous scenarios Two basic approaches – Approximate the underlying distance measure by aggregation – Provide the exact measure by data broadcasting 15 Distributed Association Rule Mining Two main approaches – Count Distribution (CD) • data is partitioned homogeneously into several data sites – Data Distribution (DD) • maximizing parallelism 16 Applications of Distributed Data Mining Credit card fraud detection Intrusion detection Information retrieval from Internet Ad hoc sensor networks 17 Challenges of Distributed Data Mining Real-time distributed data mining Adaptive to changing environment, new data, new pattern 18