Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng I. Model-Based Learning • Methods used in Data Clustering – dimension reduction • 1. Linear methods: SVD, PCA, Kernel PCA etc. • 2. Pairwise distance methods: Multidimensional scaling (MDS), etc. • 3. Topographic maps: Elastic net, SOM, generative topographic mapping(GTM) etc. • 4. Manifold learning: LLE etc. • Characters: – Cope with incomplete data – Better to explain data – Visualization • GTM as an example – Gaussian distribution over the dataset p(ti | zk ,W , ) ( D/2 ) exp{ || y ( zk ;W ) ti ||2 } 2 2 • Collaborative Filtering using GTM: Dataset: Movie data Romance vs. Action Rate on movie [0~1] Blue one: Action Each color represent a class of movie Pink one: Romance Visualize in a 2-D plane • Centralized GTM in CF: – Centralized dataset • Large scale, billions of records • Expensive to maintain • Distributed Requirement • Security concern: bank, government, military • Privacy sensitive: bank, commercial site, personal site • Scalable • Expensive to centralize • Real time huge data stream • Distributed learning way for statistical model is an important issue II. Related Work • Distributed Information Retrieval – Globally building a P2P network – Locally routing a query – Globally matching the query to a distributed dataset • Distributed Data Mining – Partition of the dataset • Horizontal or homogenous – Attributes are same in partitions • Vertical or heterogeneous – Attributed are different in partitions – Approach: • Distributed KNN • Density-Based • Distributed Bayesian network – For example: a global virtue table is built for vertical partition • Approaches to distributed learning: – Mediator based – Agent based – Grid based – Middleware based – Density-based – Model-based III. Our Approach • Problem review – Local three model – Globally merge the local models – Merge again or not? – Sparse local data – Underlying a global model • A related approach – Artificial data – A Gaussian Mixture Model over global dataset – MCMC sampling – To learn local model – From the average local model to learn global model – Privacy cost distribution: a gaussian distribution • Density based merging approach – The combined global model – K : the number of the components • pi(xt) : a Gaussian component • αi=1 is the weight value. satisfy • Merging criteria – Q = argmax(Lij) + argmin(Cosij) – Lij: likelihood measure – Cosij: Privacy cost between two model • Two consideration: – Privacy cost – Likelihood a data generated by the other model • Steps: • Locally learning models • Merging according to the likelihood and privacy control • Merging stop if no clusters is density connected. • Learn the parameters of a global GMM via K etc. • Hierarchical Approach • Local six models • Merge according to the similarity measure • Each level can be controlled by the privacy cost • Bottom up learning a hierarchical model • After a global model is learned, change the privacy control level, can change the model • Model selection – Simij = Dist(Cost(Di);Cost(Dj)) < Const • Cost(Di) : transform dataset use cost function • Dist( x , y) : operation of computing distance between two dataset • Smaller than a threshold then merge • Steps: • 1. Learn a local model from local dataset. • 2. Based on the predefined the privacy control function, merge local models to form a hierarchical global model. • 3. Relabel the local model according to the changed privacy. • Privacy Control by Data Sampling • Previously control the privacy function • Try to control the dataset sensitive to privacy – D1’ = D1 U Oa21 (D2) U Oa31(D3) U Oa41(D4) – D2’ = Oa12 (D1) U D2 U Oa32(D3) U Oa42(D4) –… Oa12: Operator over the dataset – New local dataset are reconstructed by sampling from the other local dataset at some privacy control level • P2P Approach – Local small world of network – Local global model – Storing local network information in each node – Trust propagation to connected nodes – Pass knowledge to connected small world • Algorithms: • 1. Learn a global model for each small world of local nodes. • 2. pass back global information to each node in this small • • • • world. 3. Nodei pass its trust relationship to its connected outer small world nodes at a certain value. 4. The connected nodes merge the local model with new knowledge in another model. 5. Update the connected global model knowledge, and propagate to all the local models in this small world. 6. Sum all the knowledge L3 collected, and update the G2, then repeat the step 3 - 6 until the loop criteria is satisfied: reach the iteration number or the global model change little. IV. Model Evaluation • Effective criterion – Precision • How accurate a model can be – Recall • Cover how many the right data in the model • Efficiency criterion – The communication cost • bandwidth is the same • Only proportion to partition size • Maximum data transferred – Overhead • Compare three approach with the centralized way – Complexity • Computation complexity V. Experiments Issue • Another approach for the dataset – Site vector instead of document vector • Pick out meaningful representatives of • • local models LLE vs. GTM etc. Change the privacy distribution to control the shape of global model Question & Answer