Download K - Duke ECE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sharing Clusters Among Related Groups:
Hierarchical Dirichelet Processes
Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei
NIPS 2004
Presented by Yuting Qi
ECE Dept., Duke Univ.
08/26/05
Overview






Motivation
Dirichlet Processes
Hierarchical Dirichlet Processes
Inference
Experimental results
Conclusions
Motivation


Multi-task learning: clustering
Goal:


Share clusters among multiple related
clustering problems (model-based).
Approach:



Hierarchical;
Nonparametric Bayesian;
DP Mixture Model: learn a generative
model over the data, treating the classes
as hidden variables;
Dirichlet Processes



Let (,) be a measurable space, G0 be a probability measure on
the space, and  be a positive real number.
A Dirichlet process is any distribution of a random probability
measure G over (,) such that, for all finite partitions (A1,…,Ar) of
,
G ~DP(, G0 ) if G is a random probability measure with
distribution given by the Dirichlet process.


Draws G from DP are generally not distinct, discrete,
Өk~G0, βk are random and depend on .
Properties:
,
Chinese Restaurant Processes


CRP(the polya urn scheme)
Φ1,…,Φi-1, i.i.d., r.v., distributed according to G; Ө1,…, ӨK
to be the distinct values taken on by Φ1,…,Φi-1, nk be # of
Φi’= Өk, 0<i’<i,
This slide is from “Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process”, NLP Group, Stanford, Feb. 2005
DP Mixture Model

One of the most important application of DP: nonparametric
prior distribution on the components of a mixture model.
G ~ DP ( 0 , G0 )
i | G ~ G
xi | i ~ F (i )

Why no direct application of density estimation? Because G is
discrete?
HDP – Problem statement

We have J groups of data, {Xj}, j=1,…, J. For each group, Xj={xji},
i=1, …, nj.

In each group, Xj={xji} are modeled with a mixture model. The
mixing proportions are specific to the group.

Different groups share the same set of mixture components
(underlying clusters,
), but different group is a different
combination of the mixture components.

Goal:


Discover the distribution of
Discover the distribution of
within a group;
across groups;
HDP - General representation



G0: the global prob. measure ~ DP(r, H) , r: concentration
parameter, H is the base measure.
Gj: the probability distribution for group j, ~ DP(α, G0).
Φji : the hidden parameters of distribution F(Φji) corresponding to
xji.

The overall model is:

Two-level DPs.
HDP - General representation


G0 places non-zeros mass only on
, i.i.d, r.v. distributed according to H.
, thus,
HDP-CR franchise

First level: within each group, DP mixture



G j ~ DP (0 , G0 ),  ji | G j ~ G j , x ji |  ji ~ F ( ji )
Φj1,…,Φji-1, i.i.d., r.v., distributed according to Gj; Ѱj1,…, ѰjTj to
be the values taken on by Φj1,…,Φji-1, njk be # of Φji’= Ѱjt,
0<i’<i.
Second level: across group, sharing components


Base measure of each group is a draw from DP:
Ѱjt | G0 ~ G0, G0 ~ DP(r, H),
Ө1,…, ӨK to be the values taken on by Ѱj1,…, ѰjTj , mk be # of
Ѱjt=Өk, all j, t.
HDP-CR franchise

Values of Φji are shared among groups.
Integrating out G0
Inference- MCMC

Gibbs sampling the posterior in the CR franchise:

Instead of directly dealing with Φji & Ѱjt to get p(Φ, Ѱ|X),
p(t, k, Ө|X) is achieved by sampling t, k, Ө, where,



t={tji}, tji is the table index that Φji associated with, Φji=Ѱjt .
K={kjt}, kjt is the index that Ѱjt takes value on Өk, Ѱjt=Өkjt.
ji
Knowing the prior distribution as shown in CPR franchise,
the posterior is sampled iteratively,

Sampling t:

Sampling K:

Sampling Ө:
Experiments on the synthetic data

Data description:




We have three group data;
Each group is a Gaussian mixture;
Different group can share same clusters;
Each cluster has 50 2-D data points, features are independent;
Original data
6
1
Group1
Group2
Group3
5
2
x(2)
4
3
7
4
1
6
2
5
3
4
1
2
2
1
3
0
2
4
6
x(1)
8
10
6
5
Group 3: [5, 6, 1, 7]
Group 1: [1, 2, 3, 7]
3
7
7
4
6
5
Group 2: [3, 4, 5, 7]
Experiments on the synthetic data

HDPs definition:
here, F(xji|φji) is Gussian distribution, φji={μji, σji}; φji take
values on one of θk={μk, σk}, k=1….
μ ~ N(m, σ/β), σ-1 ~ Gamma (a, b), i. e., H is NormGamma joint distribution. m, β, a, b are given
hyperparameters.

Goal:

Model each group as a Gaussian mixture
;

Model the cluster distribution over groups
;
Experiments on the synthetic data

Results on Synthetic Data

Global distribution:
Estimated underlying distribution
Global mixing propotion (over groups)
6
3
Group1
Group2
Group3
5
2.5
Mixing propotion
x(2)
4
3
2
2
1.5
1
0.5
1
0
2
4
6
x(1)
Estimated
8
10
0
1
2
3
4
5
6
7
Component Index
8
9
over all groups and the corresponding mixing proportions
The number of components is openended, here only partial is shown.
10
Experiments on the synthetic data
1-th group mixture propotion (over data)
Mixture within each group:
50
45
2-th group mixture propotion (over data)
35
50
30
45
25
40
Mixing propotion
20
15
10
5
0
1
2
3
4
5
6
7
Component Index
8
9
10
35
30
25
20
15
10
3-th group mixture propotion (over data)
50
5
45
0
40
Mixing propotion
Mixting propotion
40
1
2
3
4
5
6
7
Component Index
8
9
10
35
30
25
20
15
10
5
0
1
2
3
4
5
6
7
Component Index
8
9
10
The number of components in
each group is also open-ended,
here only partial is shown.
Conclusions & discussions



This hierarchical Bayesian method can automatically
determine the appropriate number of mixture components
needed.
A set of DPs are coupled via their base measure to
achieve the component sharing among groups.
Non-parametric priors; not non-parametric density
estimation.
Related documents