Download Conditional Gaussian Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Dimensionality Reduction in
Unsupervised Learning of Conditional
Gaussian Networks
Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I.
In IEEE Trans. on PAMI, 23(6), 2001.
Summarized by Kyu-Baek Hwang
Abstract

Feature selection for unsupervised learning of Gaussian
networks
 Unsupervised learning for Bayesian networks?
 Which feature is good for the learning task?

Assessment of the relevance of the feature for learning
process
 How to determine the threshold for cutting?

Accelerate the learning time and still obtain reasonable
models
 Two artificial datasets
 Two benchmark datasets from the UCI repository
Unsupervised Learning for Conditional
Gaussian Networks



Data clustering  learning the probabilistic graphical
model from the unlabeled data
Cluster membership  a hidden variable
Conditional Gaussian networks
 Cluster variable is the ancestor for all the other variables.
 The joint probability distribution over all the other variables
given the cluster membership is multivariate Gaussian.

Feature selection in classification  feature selection
in clustering
 Consider all the features eventually, to describe the domain.
Conditional Gaussian Distribution

Data clustering
 X = (Y, C) = (Y1, …, Yn, C)

Conditional Gaussian distribution
 Pdf for Y given C = c is,
 whenever
p(c) = p(C = c) > 0
Positive definite
Conditional Gaussian Networks

Factorization of the conditional Gaussian distribution
 Conditional independencies among all the variables is encoded
by the network structure s.
 Local probability distribution
An Example of CGNs
C
Learning CGNs from Data

Incomplete dataset d
N

n
1
O
H
Structural EM algorithm
Structural EM Algorithm
Relaxed version:
p(c | y i ,ˆsl , slh )
Expected score
Scoring Metrics
for the Structural Search

The log marginal likelihood of the expected complete
data
Feature Selection

Large databases
 Many instances
 Many attributes
  Dimensionality reduction required

Select features based on some criterion.
 The criterion differs from the purpose of learning.
 Learning speed, accurate predictions, and the comprehensibility
of the learned models

Non exhaustive search (2n)
 Sequential selection (forward or backward)
 Evolutionary, population-based, randomized search based on the
EDA.
Wrapper and Filter

Wrapper
 Feature subsets tailored to the performance function of learning
process
 Predictive accuracy on the test data set.

Filter
 Based on the intrinsic properties of the data set.
 Correlation between the class label and each attribute


 Supervised learning
Two problems in unsupervised learning
 Absence of the class label  different criterion for the feature
selection
 No standard accepted performance task  multiple predictive
accuracy or class prediction
Feature Selection in Learning CGNs

Data analysis (clustering)  description, not prediction
 All the features are necessary for the description.

CGN learning with many features is a time-consuming
task.
 Preprocessing: feature selection
 Learning CGNs
 Postprocessing: addition of the other features as conditionally
independent given the cluster membership

The goal  how to measure the relevance
 Fast learning time
 Accuracy  log likelihood for the test data
Relevance

Those features that exhibit low correlation with the rest
of the features can be considered irrelevant for the
learning process.
 Conditionally independent given the cluster membership.

First trial in the continuous domain
Relevance Measure

The relevance measure:
 Null hypothesis (edge exclusion test)

r2ij|rest
 The sample partial correlation of Yi and Yj
 The maximum likelihood estimates (mles) of the elements of the
inverse variance matrix
Graphical Gaussian Models (1/2)
1 T
f ( x; )  (2 )
|  | exp{  x x},
2
   1 , w  vec ()
1
1
l ( x; w)  c  log |  |  tr (xxT )
2
2
1
1
 c  log |  |  wT Js
2
2
J : the diagonal matrix for coefficien ts
p/2
1/ 2
s  vec ( xxT )
1 
1
1
log |  |  Js  J (  s),
2 w
2
2
  vec ()
U ( w) 
Graphical Gaussian Models (2/2)

1

U
(
w
)


J

T
T
w
2 w
 : the mean - value mapping
So,
I
 T 1
w J

1
cov( wˆ ij wˆ rs )  ( wir w js  wis w jr )
N
r12|rest   wˆ 12 ( wˆ 11wˆ 22 ) 1/ 2
I 1  2
Relevance Threshold

Distribution of the test statistic
 G(x): pdf of a 12 random variable
 5 percent test
 The resolution of the above equation  optimization
Learning Scheme
Experimental Settings

Model speicifications
 Tree augmented Naïve Bayes (TANB) models
 Predictive attributes may have, at most, one other predictive
attribute as a parent.

An example
C
Data Sets

Synthetic data sets (4000:1000)
 TANB model with 25 (15:14[-1, 1]) attributes, (0, 4, 8), 1
 C:
uniform, (0, 1)
 TANB model with 30 (15:14[-1, 1]) attributes, (0, 4, 8), 2
 C:

uniform, (0, 5)
Waveform (artificial data) (4000:1000)
 3 clusters, 40 attributes, the last 19 are noise attributes

Pima
 768 cases (700:68)
 8 attributes
Performance Criteria


The log marginal likelihood of the training data
The multiple predictive accuracy
 A probabilistic approach to the standard multiple predictive
accuracy

Runtime
 10 independent runs for the synthetic data sets and the
waveform data
 50 independent runs for the pima data
 On a Pentium 366 machine
Relevance Ranking
Likelihood Plots for Synthetic Data
Likelihood Plots for Real Data
Runtime
Automatic Dimensionality Reduction
Conclusions and Future Work






Relevance assessment for feature selection in
unsupervised learning and continuous domain
Reasonable learning performance
Extension to categorical domain
Redundant feature problem
Relaxation of the model structure
More realistic data set