Download Local one class optimization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biology and consumer behaviour wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Local one class optimization
Gal Chechik, Stanford
joint work with Koby Crammer,
Hebrew university of Jerusalem
The one-class problem:
Find a subset of similar/typical samples
Formally: find a ball of a given radius (with
some metric) that covers as many data points
as possible (related to the set covering
problem).
Motivation I
Unsupervised setting: Sometimes we wish to
model small parts of the data and ignore the
rest. This happens when many data points are
irrelevant.
Example:
– Finding sets of co-expressed
genes in genome wide-experiment:
identify the relevant genes out of
thousands irrelevant ones.
– Finding a set of document of
the same topic, in an
heterogeneous corpus
Motivation II
Supervised setting:
Learning given positive samples only
Examples:
– Protein interactions
– Intrusion detection application
Care about low false positive rate
Current approaches
Often treat the problem as
Outliers and novelty detection:
most samples are relevant
Current approaches use
– A convex cost function (Schölkopf 95, Tax and Duin 99, Ben-Hur et al
2001).
– A parameter that affects the
size or weight of the ball
• Bias towards center of mass
When searching for a small ball, the
center of the optimal ball is in the global
center of mass, w*=argmin Σx(x-w)2
missing the interesting structures.
Current approaches
Example with synthetic data:
– 2 Gaussians + uniform background
Convex one class (OSU-SVM)
Local one-class
How do we do it:
1. A cost function designed for small sets
2. A probabilistic approach: allow soft
assignment to the set
3. Regularized optimization
1. A cost function for small sets
• The case where only few samples are relevant
• Use cost function that is flat for samples not
in the set
– Two parameters:
• Divergence measure DBF
• Flat cost K
– Indifferent to the position of “irrelevant” samples.
– Solutions converge to the center of mass when ball
is large.
2. A probabilistic formulation
• We are given m samples in a d dimensional
space or simplex, indexed by x {v x }mx1 m.
• p(x) is the prior distribution over samples
• c ={TRUE,FALSE} is an R.V. that characterizes
assignment to the interesting set (the “Ball”).
• p(c|x) reflects our belief that the sample x is
“interesting”.
• The cost function will be
D=p(c|x)DBF(w|vx) + (1-p(c|x))K
DBF is a divergence measure,
to be discussed later
3. Regularized optimization
The goal: minimize the mean cost+regularization
min β <DBF,K(,wC;vx)>p(c,x) + I(C;X)
{p(c|x),w}
• The first term: measures the mean distortion
<DBF,R(p(c|x),w;vx)> =
Σ p(x) [p(c|x)BF(w|vx)+(1-p(c|x))K]
• The second term: regularizes the compression
of the data (removes information about X)
I(C;X) = H(X) – H(X|C),
It pushes for putting many points in the set.
• This target function is not convex
To solve the problem
• It turns out that for a family of divergence
functions, called Bregman divergences, we can
analytically describe properties of the optimal
solution.
• The proof follows the analysis of the
Information Bottleneck method
(Tishby,Pereira,Bialek,99)
Bregman divergences
• A Bregman divergence is defined by a convex
function F (in our case F(v)=Σf(vi))
BF (v||w) F(v)[F(w)F(w)(vw)]
• Common examples:
L2 norm
f(x)=½x2
Itkura-Saito
f(x)=-log(x)
DKL
f(x)=xlog(x)
Unnormalized relative entropy f(x)=xlogx-x
• Lemma: Convexity of the Bregman Ball
The set of points {v s.t. BF(v||w)<R} is convex
Properties of the solution
OC solutions obey three fixed point equations
p(c)  p(c|x)p(x)
1
w
p(x)p(c|x) v x

p(c)
 1 p(c)

p(c|x) 1/ 1
exp [BF (v x ||w) K]
 p(c)

When β→∞,
 1

limp(c|x)   0
 
p(c)

BF ( v x || w)  K
BF ( v x || w)  K
BF ( v x || w)  K
Best assignment for x is to minimize
Loss  min(BF (vx ||w),K)
The effect of the K
• K controls the nature of the solution.
– Is the cost of leaving a point out of the ball
– Large K => large radius & many points in set
– For the L2 norm, K is formally related to the prior
of a single Gaussian fit to the subset.
• A full description of a data may require to
solve for the complete spectrum of K values.
Algorithm: One-Class IB
Adapting the sequential-IB algorithm:
One-Class IB:
Input: set of m points vx, divergence BF, cost K
Output: centroid w, assignment p(c|x)
Optimization method:
– Iteratively operating sample-by-sample, try to modify the
status of a single sample
– One step Look-ahead re-fit the model and decide if to change
assignment of a sample
– This uses a simple formula because of the nice properties of
Bregman divergences
– search in the dual space of samples, rather than parameters w.
Experiments 1: information retrieval
Five most frequent categories of Reuters21578.
Each document represented as a multinomial
distribution over 2000 terms.
The experimental setup: For each category:
– train with half of the positive documents,
– test with all rest of documents
Compared one-class IB with One-class Convex
which uses a convex loss function (Crammer&
Singer-2003). Controlled by a single parameter η,
that determines weight of the class.
Experiments 1: information retrieval
precision
Compare precision recall performance, for a
range or K/μ values.
recall
Experiments 1: information retrieval
Centroids of clusters, and their distances from
the center of mass
Experiments 2: gene expression
A typical application for searching small but
interesting sets of genes.
Genes represented by
expression profile across
tissues from different
patients
Alizadeh-2000, (B-cell
lymphoma tissues) has
mortality data which can be
used as an objective method
for validating quality of the
genes selected.
Experiments 2: gene expression
Significance of regression
prediction (p- value)
One-class IB compared with one-class SVM (L2)
For a series of K values, gene sets with lowest
loss function was found (10 restarts).
The set of genes was used for regression vs,
the mortality data.
good
bad
Future work: finding ALL relevant subsets
• Complete characterization of all interesting
subsets in the data.
• Assume we have a function that assign an
interest value to each subset. We search in
the space of subsets and for all local maxima.
• Requires to define the locality. A natural
measure of locality in the subsets-space is
the Hamming distance.
• The complete characterization of the data
require description using a range of local
neighborhoods.
Future work: multiple one-class
• Synthetic example: two overlapping Gaussians
and background uniform noise
Conclusions
• We focus on learning one-class for cases
where a small ball is sought.
• Formalize the problem using IB, and derive its
formal solutions
• One-class IB performs well in the regime of
small subsets.