Download Distributed Model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Machine learning wikipedia , lookup

Cross-validation (statistics) wikipedia , lookup

Perceptual control theory wikipedia , lookup

Neural modeling fields wikipedia , lookup

Mixture model wikipedia , lookup

Time series wikipedia , lookup

Agent-based model in biology wikipedia , lookup

Mathematical model wikipedia , lookup

Transcript
Distributed Model-Based
Learning
PhD student: Zhang, Xiaofeng
I. Model-Based Learning
• Methods used in Data Clustering
– dimension reduction
• 1. Linear methods: SVD, PCA, Kernel PCA etc.
• 2. Pairwise distance methods: Multidimensional
scaling (MDS), etc.
• 3. Topographic maps: Elastic net, SOM, generative
topographic mapping(GTM) etc.
• 4. Manifold learning: LLE etc.
• Characters:
– Cope with incomplete data
– Better to explain data
– Visualization
• GTM as an example
– Gaussian distribution over the dataset
p(ti | zk ,W ,  )  (
 D/2

) exp{ || y ( zk ;W )  ti ||2 }
2
2
• Collaborative Filtering using GTM:
Dataset: Movie data
Romance vs. Action
Rate on movie [0~1]
Blue one: Action
Each color represent a class of
movie
Pink one: Romance
Visualize in a 2-D plane
• Centralized GTM in CF:
– Centralized dataset
• Large scale, billions of records
• Expensive to maintain
• Distributed Requirement
• Security concern: bank, government, military
• Privacy sensitive: bank, commercial site, personal
site
• Scalable
• Expensive to centralize
• Real time huge data stream
• Distributed learning way for statistical
model is an important issue
II. Related Work
• Distributed Information Retrieval
– Globally building a P2P network
– Locally routing a query
– Globally matching the query to a distributed
dataset
• Distributed Data Mining
– Partition of the dataset
• Horizontal or homogenous
– Attributes are same in partitions
• Vertical or heterogeneous
– Attributed are different in partitions
– Approach:
• Distributed KNN
• Density-Based
• Distributed Bayesian network
– For example: a global virtue table is built for vertical
partition
• Approaches to distributed learning:
– Mediator based
– Agent based
– Grid based
– Middleware based
– Density-based
– Model-based
III. Our Approach
• Problem review
– Local three model
– Globally merge the
local models
– Merge again or not?
– Sparse local data
– Underlying a global model
• A related approach
– Artificial data
– A Gaussian Mixture Model
over global dataset
– MCMC sampling
– To learn local model
– From the average local
model to learn global model
– Privacy cost distribution: a
gaussian distribution
• Density based merging approach
– The combined global model
– K : the number of the components
• pi(xt) : a Gaussian component
• αi=1 is the weight value. satisfy
• Merging criteria
– Q = argmax(Lij) + argmin(Cosij)
– Lij: likelihood measure
– Cosij: Privacy cost between two model
• Two consideration:
– Privacy cost
– Likelihood a data generated by the other model
• Steps:
• Locally learning models
• Merging according to the likelihood and privacy
control
• Merging stop if no clusters is density connected.
• Learn the parameters of a global GMM via
K
etc.
• Hierarchical Approach
• Local six models
• Merge according to the
similarity measure
• Each level can be
controlled by the
privacy cost
• Bottom up learning a
hierarchical model
• After a global model is
learned, change the
privacy control level,
can change the model
• Model selection
– Simij = Dist(Cost(Di);Cost(Dj)) < Const
• Cost(Di) : transform dataset use cost function
• Dist( x , y) : operation of computing distance
between two dataset
• Smaller than a threshold then merge
• Steps:
• 1. Learn a local model from local dataset.
• 2. Based on the predefined the privacy control
function, merge local models to form a
hierarchical global model.
• 3. Relabel the local model according to the
changed privacy.
• Privacy Control by Data Sampling
• Previously control the privacy function
• Try to control the dataset sensitive to privacy
– D1’ = D1 U Oa21 (D2) U Oa31(D3) U Oa41(D4)
– D2’ = Oa12 (D1) U D2 U Oa32(D3) U Oa42(D4)
–…
 Oa12: Operator over the dataset
– New local dataset are reconstructed by sampling from
the other local dataset at some privacy control level
• P2P Approach
– Local small world of
network
– Local global model
– Storing local network
information in each
node
– Trust propagation to
connected nodes
– Pass knowledge to
connected small world
• Algorithms:
• 1. Learn a global model for each small world of local nodes.
• 2. pass back global information to each node in this small
•
•
•
•
world.
3. Nodei pass its trust relationship to its connected outer
small world nodes at a certain value.
4. The connected nodes merge the local model with new
knowledge in another model.
5. Update the connected global model knowledge, and
propagate to all the local models in this small world.
6. Sum all the knowledge L3 collected, and update the G2,
then repeat the step 3 - 6 until the loop criteria is satisfied:
reach the iteration number or the global model change little.
IV. Model Evaluation
• Effective criterion
– Precision
• How accurate a model can be
– Recall
• Cover how many the right data in the model
• Efficiency criterion
– The communication cost
• bandwidth is the same
• Only proportion to partition size
• Maximum data transferred
– Overhead
• Compare three approach with the centralized way
– Complexity
• Computation complexity
V. Experiments Issue
• Another approach for the dataset
– Site vector instead of document vector
• Pick out meaningful representatives of
•
•
local models
LLE vs. GTM etc.
Change the privacy distribution to
control the shape of global model
Question & Answer