Download Trees - Carnegie Mellon School of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quadtree wikipedia , lookup

Binary search tree wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Transcript
Efficient Learning in High Dimensions
with
Trees and Mixtures
Marina Meila
Carnegie Mellon University
Multidimensional data
· Multidimensional (noisy) data
Learning
· Learning tasks - intelligent data analysis
·
·
·
·
categorization (clustering)
classification
novelty detection
probabilistic reasoning
· Data is changing, growing
· Tasks change
need to make learning automatic, efficient
Combining probability and algorithms
· Automatic
· Efficient
probability and statistics
algorithms
· This talk
the tree statistical model
Talk overview
Introduction:
statistical models
Perspective:
generative models
and decision tasks
The tree model
Mixtures
of trees
Learning
Experiments
Accelerated
learning
Bayesian
learning
A multivariate domain
· Data
Patient1
Patient2
Smoker
Bronchitis
Lung cancer
Cough
X ray
Smoker
Bronchitis
Lung cancer
Cough
X ray
............
Statistical model
· Queries
· Diagnose new patient
Smoker
Bronchitis
Lung cancer?
Cough
X ray
Cough
X ray
· Is smoking related to lung cancer?
Smoker
Bronchitis
Lung cancer?
· Understand the “laws” of the domain
Probabilistic approach
· Smoker, Bronchitis .. (discrete) random variables
· Statistical model (joint distribution)
P( Smoker, Bronchitis, Lung cancer, Cough, X ray )
summarizes knowledge about domain
· Queries:
· inference
e.g. P( Lung cancer = true | Smoker = true, Cough = false )
· structure of the model
• discovering relationships
• categorization
Probability table representation
v1v2 00 01 11 00
v3
0
.01 .14 .22 .01
1
.23 .03 .33 .03
· Query:
P(v1=0 | v2=1) =
P(v1=0, v2=1)
P(v2=1)
=
.14 + .03
.14 + .3 + .22 + .33
= .23
· Curse of dimensionality
if v1, v2, … vn binary variables
·
·
·
·
How to represent?
How to query?
How to learn from data?
Structure?
PV1,V2…Vn table with 2n entries!
Graphical models
· Structure
· vertices = variables
· edges = “direct dependencies”
· Parametrization
· by local probability tables
Galaxy
type
spectrum
dust
Obs
spectrum
photometric
measurement
·
·
·
·
compact parametric representation
efficient computation
learning parameters by simple formula
learning structure is NP-hard
distance
size
Z (redshift)
observed
size
The tree statistical model
· Structure
·
tree (graph with no cycles)
Parameters
· probability tables associated to edges
1
1
T3
3
2
T(x) =
4
5
T34
3
equivalent
2
P
T (x x )
uv E uv u v
deg v-1
P Tv(xv)
T(x) =
v V
• T(x) factors over tree edges
4
5
T4|3
P Tv|u(xv|xu)
uv E
Examples
· Splice junction domain
junction type
-7
-6
-5
-4
-3
-2
-1
+2
+1
+3
+4
+5
· Premature babies’ Bronho-Pulmonary Disease (BPD)
PulmHemorrh
Coag
HyperNa
Acidosis
Gestation
Thrombocyt
Weight
Hypertension
Temperature
BPD
Neutropenia
Suspect
Lipid
+6
+7
+8
Trees - basic operations
T(x) =
P Tuv(xuxv)
uv E
P Tvdeg
(xvv) -1
v V
|V| =n
Querying
the model
Estimating
the model
·
·
·
·
·
computing likelihood T(x) ~ n
conditioning TV-A|A (junction tree algorithm) ~ n
marginalization Tuv for arbitrary u,v ~ n
sampling ~ n
fitting to a given distribution ~ n2
• learning from data ~ n2Ndata
· is a simple model
The mixture of trees
m
Q(x) = S lkTk(x)
k=1
h = “hidden” variable
P( h=k ) = lk
k = 1, 2 . . . m
· NOT a graphical model
· computational efficiency preserved
(Meila 97)
Learning - problem formulation
· Maximum Likelihood learning
· given a data set D = { x1, . . . xN }
· find the model that best predicts the data
Topt = argmax T(D)
· Fitting a tree to a distribution
· given a data set D = { x1, . . . xN }
and distribution P that weights each data point,
· find
Topt = argmin KL( P || T )
· KL is Kullbach-Leibler divergence
· includes Maximum likelihood learning as a special case
Fitting a tree to a distribution
Topt = argmin KL( P || T )
· optimization over structure + parameters
· sufficient statistics
· probability tables Puv = Nuv/N u,v
· mutual informations Iuv
Iuv =
S
V
Puv
Puv log
PuPv
(Chow & Liu 68)
Fitting a tree to a distribution - solution
· Structure
Eopt = argmax
Suv IEuv
I12
I23
· found by Maximum Weight
Spanning Tree algorithm with
edge weights Iuv
I34
I56
I45
· Parameters
· copy marginals of P
Tuv = Puv for uv
I63
E
I61
Learning mixtures by the EM algorithm
Meila & Jordan ‘97
E step
which xi come from T k?
distribution P k(x)
M step
fit T k to set of points
min KL( Pk||Tk )
· Initialize randomly
· converges to local maximum of the likelihood
Remarks
· Learning a tree
· solution is globally optimal over structures and parameters
· tractable: running time ~ n2N
· Learning a mixture by the EM algorithm
· both E and M steps are exact, tractable
· running time
• E step ~ mnN
• M step ~ mn2N
· assumes m known
· converges to local optimum
Finding structure - the bars problem
Data n=25
learned structure
Structure recovery: 19 out of 20 trials
Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous)
0.95 +/- 0.01 (unambiguous)
Data likelihood [bits/data point] true model 8.58
learned model 9.82 +/-0.95
Experiments - density estimation
· Digits and digit pairs
Ntrain = 6000 Nvalid = 2000 Ntest = 5000
n = 64 variables ( m = 16 trees )
n = 128 variables ( m = 32 trees )
DNA splice junction classification
· n = 61 variables
· class = Intron/Exon, Exon/Intron, Neither
Supervised
(DELVE)
Tree
TANB
NB
Discovering structure
Tree
adjacency
matrix
class
IE junction
Intron
Exon
15 16 . . . 25 26 27 28 29 30 31
Tree - CT CT CT - CT A G G
True CT CT CT CT - CT A G G
(Watson “The molecular biology of the gene” 87)
EI junction
Exon
28 29 30 31 32
Tree CA A G
G T
True CA A G
G T
Intron
33 34 35 36
AG A G AG A G T
Irrelevant variables
61 original variables + 60 “noise” variables
Original
Augmented with irrelevant variables
Accelerated tree learning
· Running time for the tree learning algorithm ~ n2N
· Quadratic running time may be too slow:
Example: document classification
· document = data point
--> N = 103-4
· word = variable
--> n = 103-4
· sparse data --> #words in document
s and s << n,N
· Can sparsity be exploited to create faster algorithms?
Meila ‘99
Sparsity
· assume special value “0” that occurs frequently
sparsity = s
 # non-zero variables in each data point  s
s << n, N
· Idea:
“do not represent / count zeros”
Sparse
data
010000100001000
Linked list
000100000100000
length
010000000000001
s
Presort mutual informations
Theorem (Meila,99) If v, v’ are variables that do not cooccur with u
in V (i.e. Nuv = Nuv’ = 0 ) then
Nv > Nv’ ==> Iuv > Iuv’
· Consequences
· sort Nv => all edges uv , Nuv = 0 implicitly sorted by Iuv
· these edges need not be represented explicitly
· construct black box that outputs next “largest” edge
The black box data structure
Nv
v1
v2
list of u , Nuv > 0, sorted by Iuv
v
F-heap
of size
~n
list of u, Nuv =0, sorted by Nv (virtual)
vn
next edge uv
n log n + s2N + nK log n
(standard alg. running time n2N )
Total running time
Experiments - sparse binary data
Standard
accelerated
· N = 10,000
· s = 5, 10, 15, 100
Remarks
·
·
·
·
Realistic assumption
Exact algorithm, provably efficient time bounds
Degrades slowly to the standard algorithm if data not sparse
General
· non-integer counts
· multi-valued discrete variables
Bayesian learning of trees
Meila & Jaakkola ‘00
· Problem
· given prior distribution over trees P0(T)
data D = { x1, . . . xN }
· find posterior distribution P(T|D)
· Advantages
· incorporates prior knowledge
· regularization
· Solution
· Bayes’ formula
P(T|D) =
1
P0(T) P T(xi)
i=1,N
Z
· practically hard
• distribution over structure E and parameters
hard to represent
• computing Z is intractable in general
• exception: conjugate priors
qE
Decomposable priors
T = P f( u, v, qu|v)
uv E
· want priors that factor over tree edges
· prior for structure E
P0(E)
a P buv
uv E
· prior for tree parameters
P0(qE)
= P D( qu|v ; N’uv )
uv E
· (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v
· posterior is also Dirichlet with hyper-parameters
Nuv(xuxv) + N’uv(xuxv), u,v
V
V
Decomposable posterior
· Posterior distribution
P(T|D)
a P Wuv
uv
E
· factored over edges
· same form as prior
Wuv =
buv D( qu|v; N’uv + Nuv )
· Remains to compute the normalization constant
Discrete: graph theory
continuous: Meila & Jaakkola 99
The Matrix tree theorem
· Matrix tree theorem
If
P0(E) = 1
Z
P buv, buv
uv E
v
-buv
M( b ) =
Then
-buv
Sb
v’
Z = det M(
vv'
b)
u
0
Remarks on the decomposable prior
· Is a conjugate prior for the tree distribution
· Is tractable
·
·
·
·
defined by ~ n2 parameters
computed exactly in ~ n3 operations
posterior obtained in ~ n2N + n3 operations
derivatives w.r.t parameters, averaging, . . . ~ n3
· Mixtures of trees with decomposable priors
· MAP estimation with EM algorithm tractable
· Other applications
· ensembles of trees
· maximum entropy distributions on trees
So far . .
· Trees and mixtures of trees are structured statistical models
· Algorithmic techniques enable efficient learning
• mixture of trees
• accelerated algorithm
• matrix tree theorem & Bayesian learning
· Examples of usage
· Structure learning
· Compression
· Classification
Generative models and discrimination
· Trees are generative models
· descriptive
· can perform many tasks suboptimally
· Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99)
·
·
·
·
optimize for specific tasks
use generative models
combine simple models into ensembles
complexity control - by information theoretic principle
· Discrimination tasks
· detecting novelty
· diagnosis
· classification
Bridging the gap
Tasks
Descriptive
learning
Discriminative
learning
Future . . .
· Tasks have structure
·
·
·
·
multi-way classification
multiple indexing of documents
gene expression data
hierarchical, sequential decisions
Learn structured decision tasks
· sharing information btw tasks (transfer)
· modeling dependencies btw decisions