Download Painless embeddings of distributions: the function space view

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Painless embeddings of distributions:
the function space view
Part 3 - Conditional independence with kernels
Arthur Gretton (MPI), Alex Smola (NICTA) , Kenji Fukumizu (ISM)
[email protected]
ICML 2008 Tutorial
July 5, Helsinki, Finland
3-1
Outline of Part 3
I.
Introduction
II. Conditional independence with kernels
III. Application to causal inference
IV. Summary
3-2
I. Introduction
3-3
Functional Space View
„ Embedding into RKHS
Φ (X) = k( , X)
X
Ω (original space)
Φ
feature map
H (RKHS)
„ Basic statistics on RKHS
– Mean element
Æ characterizes a probability
– Covariance operator
Æ independence/dependence
– Conditional covariance operator
Æ conditional independence/dependence
3-4
Conditional Independence
„ Definition
X, Y, Z: random variables with joint probability density p XYZ ( x, y, z )
X and Y are conditionally independent given Z, if
pY |ZX ( y | z , x) = pY |Z ( y | z )
or
X
Z
Y
If Z is known, the information
of X is not necessary to predict Y.
p XY |Z ( x, y | z ) = p X |Z ( x | z ) pY |Z ( y | z )
Z
X
Y
3-5
Example
„ Applications in statistical inference
– Graphical modeling:
Separability in a graph implies conditional independence
– Causal inference
A formulation of causality is given by conditional independence
„ Example: Time series
– X causes Y?
Yt-1 Yt
Y
Non-causality
p(Yt | Yt-1 , Xt-1) = p(Yt | Yt-1) ?
X
Xt-1 Xt
Yt
Xt-1 | Yt-1
?
3-6
II. Conditional independence
with kernels
3-7
Review: Conditional Independence
for Gaussian Variables
„ Conditional covariance of Gaussian variables
– ( X , Y , Z ) : multidimensional jointly Gaussian variable
– Conditional covariance matrix
−1
VYX |Z ≡ Cov[Y , X | Z = z ] = VYX − VYZVZZ
VZX
VXY etc.: covariance matrix
Note: VYX|Z does not depend on the value of z
„ Conditional independence for Gaussian variables
X
Y |Z
⇔
VXY |Z = O
i.e.
−1
VYX − VYZVZZ
VZX = O
3-8
Conditional Covariance on RKHS
„ Conditional Cross-covariance operator
X, Y, Z : random variables taking values on X, Y, Z (resp.).
(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on X, Y, Z (resp.).
– Conditional cross-covariance operator H X → H Y
ΣYX |Z ≡ ΣYX − ΣYZ Σ −1
ZZ Σ ZX
( ΣYX etc.: covariance operators)
−1
c.f. VYX |Z = VYX − VYZVZZ VZX
−1
Note: Σ ZZ may not exist. But, we have the decomposition
/2
ΣYX = Σ1YY/ 2WYX Σ1XX
Rigorously, define
with || WYX ||≤ 1
/2
ΣYX |Z ≡ ΣYX − Σ1YY/ 2WYZWZX Σ1XX
1/ 2
1/ 2 T
T
* A = UΛ U if A = UΛU .
3-9
Conditional Covariance and
Conditional Covariance Operator
„ Cond. cov. operator expresses cond. covariance
Theorem (FBJ’06, Sun et al. ’07)
X, Y, Z : random variables taking values on X, Y, Z (resp.).
(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on X, Y, Z (resp.).
Assume k Z is a characteristic kernel.
g , ΣYX |Z f = E [Cov[ g (Y ), f ( X ) | Z ]]
or
[
]
ΣYX|Z = EZ ∫ ΦY (Y ) ⊗ ΦX ( X )dP( X ,Y | Z )
− EZ [∫ ΦY (Y ) ⊗ ΦX ( X )dP( X | Z )dP(Y | Z )]
= μ XY − μ EZ [Y
X |Z ]
– c.f. for Gaussian variables
aT VXY |Z b = Cov[aT X , bT Y | Z ]
(not dependent on the value of z)
3-10
Conditional Independence with Kernels
(FBJ2004, FBJ2006, Sun et al. 2007)
Extended variables are used.
X&& = ( X , Z ), Y&& = (Y , Z )
kX&& = kX k Z , kY&& = kY k Z
Theorem (FBJ’06, Sun et al. ’07)
Assume the kernels kX&& , kY , and k Z are characteristic. Then,
X
Y |Z
⇔
ΣYX&&|Z = O
– c.f. for Gaussian variables,
X
(⇔
Y |Z
ΣY&&X |Z = O ⇔ ΣY&&X&&|Z = O
⇔
)
VXY |Z = O
– With characteristic kernels, comparison between the (conditional)
mean elements on RKHS characterizes conditional independence.
– Why is the “extended variable” needed?
ΣYX |Z = O ⇒ p ( x, y ) = ∫ p ( x | z ) p ( y | z ) p ( z )dz
ΣY [ X ,Z ]|Z = O ⇒
p ( x, y, z ' ) = ∫ p ( x, z '| z ) p ( y | z ) p ( z )dz
where
p ( x, z '| z ) = p( x | z )δ ( z '− z )
3-11
Measure of Conditional Independence
„ Hilbert-Schmidt norm of cond. cov. operator
HSCIC = Σ X&&Y&&|Z
With characteristic kernels,
2
X&& = ( X , Z ), Y&& = (Y , Z )
HS
HSCIC = 0 ⇔
Y |Z
X
„ Empirical estimation is painless!
(X1, Y1, Z1), ... , (XN, YN, ZN) : data
(
Σ X&&Z → Σˆ (&N& ) = 1 ∑iN=1 (kX&& ( ⋅ , X&&i ) − mˆ X&& ) ⊗ (k Z ( ⋅ , Z i ) − mˆ Z ), Σ −1 → Σˆ ( N ) + ε I
XZ
ZZ
N
ZZ
N
2
−1 ( N ) 2
N
N
N
(
)
(
)
(
)
HSCIC = Σ && &&
Æ HSCICemp = Σˆ &&&& − Σˆ && Σˆ ZZ + ε N I Σˆ &&
XY |Z HS
[
YX
YZ
(
−1 ~ ~
~ ~
~ ~
HSCICemp = Tr K X&& KY&& − 2 K X&& (K Z + Nε N I N ) K Z KY&&
)
ZX
)
−1
HS
−1
−1 ~ ~
~ ~
~
+ K Z (K Z + Nε N I N ) K X&& (K Z + Nε N I N ) K Z KY&&
]
3-12
Normalized Cond. Covariance
„ Normalized conditional cross-covariance operator
(
)
−1 / 2
1/ 2
−1 / 2
1/ 2
WYX |Z = ΣYY
ΣYX |Z Σ −XX
= ΣYY
ΣYX − ΣYZ Σ −ZZ1 Σ ZX Σ −XX
/2
Recall: ΣYX = Σ1YY/ 2WYX Σ1XX
HSNCIC = WX&&Y&&|Z
2
HS
HSNCICemp = Tr [RX&& RY&& − 2 RX&& RY&& RZ + RX&& RZ RY&& RZ ]
−1
~ ~
RX&& ≡ K X&& (K X&& + Nε N I N ) etc.
„ Kernel-free expression.
|| WY&&X&&|Z ||2HS
=
∫∫
With characteristic kernels,
2
⎛ p XYZ ( x, y, z ) − p X |Z ( x | z ) pY |Z ( y | z ) pZ ( z ) ⎞
⎜⎜
⎟⎟ p XZ ( x, z ) pYZ ( y, z )dxdydz
p XZ ( x, z ) pYZ ( y, z )
⎝
⎠
(“Conditional” mean square contingency)
3-13
Conditional Independence Test
„ Background
– There are no good methods for conditional independence test on
non-Gaussian continuous variables. (e.g. discretizing all variables).
TN = HSNCICemp
– Partition the values of Z into C1, …, CL,
and define Al = {i | Z i ∈ Cl } (l = 1,..., L).
– Resampling (for b = 1,2,…)
1. Generate pseudo cond. indep. sample D(b)
by permuting X within each Al .
2. Compute TN(b) for the sample D(b) .
Approximate the null distribution by samples.
permute
or
permute
TN = HSCICemp
permute
„ Permutation test with the kernel measure
{
{
{
X 1,i1 Y1,i1
X 1,i2 Y1,i2 C1
X 1,i3 Y1,i3
X 2,i4 Y2,i4
X 2,i2 Y2,i2 C2
X 2,i6 Y2,i6
…
X L ,i7 YL ,i7
X L ,i8 YL ,i8 C L
X L ,i9 YL ,i9
– Set the threshold for the significance level (e.g. 5%).
3-14
Application to Graphical Modeling
– Three continuous variables of medical measurements. N = 35.
(Edwards 2000, Sec.3.1.4)
Creatinine clearance (C), Digoxin clearance (D), Urine flow (U)
Kernel mehod (permut. test)
HSN(C)IC
D U| C
C D
C U
D U
Linear method
P-val.
(partial) cor.
P-val.
1.458
0.924
Parcor(D,U|C)
0.4847
0.0037
0.776
<0.001
Cor(C,D)
0.7754
0.0000
0.194
0.117
Cor(C,U)
0.3092
0.0707
0.343
0.023
Cor(D,U)
0.5309
0.0010
– Suggested undirected graphical model by kernel method
D
U
The conditional independence D U | C
coincides with the medical knowledge.
C
3-15
III. Application to causal
inference
3-16
Causal Inference
„ With manipulation – intervention
X is a cause of Y?
X
manipulate
Y
observation
Easier. (do-calculus, Pearl 1995)
„ No manipulation / with temporal information
X (t ) Y (t ) : observed time series
X(1), …, X(t) are a cause of Y(t+1)?
„ No manipulation / no temporal information
X
Causal inference is harder.
Y
3-17
Causality of Time Series
„ Causality by conditional independence
– Extended notion of Granger causality (linear AR)
X is NOT a cause of Y if
p (Yt | Yt −1 ,..., Yt − p , X t −1 ,..., X t − p ) = p (Yt | Yt −1 ,..., Yt − p )
Yt-1 Yt
Yt
X t −1 ,..., X t − p | Yt −1 ,..., Yt − p
X
– Kernel measures for causality
HSCIC = Σˆ Y(&&NXp−|Yp +p 1)
HSNCIC =
Y
2
Xt-1 Xt
HS
2
( N − p +1)
ˆ
WY&&Xp |Yp
HS
X p = {( X t −1, X t −2 ,L, X t − p ) ∈ R p | t = p + 1,..., N }
Yp = {(Yt −1,Yt −2 ,L, Yt − p ) ∈ R p | t = p + 1,..., N }
3-18
Example
„ Coupled Hénon map
Yt-1 Yt
– X, Y:
Y
γ
⎧ x1 (t + 1) = 1.4 − x1 (t ) 2 + 0.3 x2 (t )
X
⎨
⎩ x2 (t + 1) = x1 (t )
⎧⎪ y1 (t + 1) = 1.4 − γ x1 (t ) y1 (t ) + (1 − γ ) y1 (t ) 2 + 0.1 y2 (t )
⎨
⎪⎩ y2 (t + 1) = y1 (t )
{
2
}
2
x1-x2
Xt-1 Xt
x1-y1
2
2.5
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
-0.5
-0.5
-0.5
-1
-0.5
-1
-1.5
-2
-2
-1
0
1
2
-1
-2
-1
0
γ=0
1
2
-1.5
-2
-1
-1
0
γ = 0.25
1
2
-1.5
-2
-1
0
γ = 0.8
1
3-19
2
„ Causality of coupled Hénon map
– X is a cause of Y if γ > 0.
Yt
– Y is not a cause of X for all γ.
X t −1 ,..., X t − p | Yt −1 ,..., Yt − p
Xt
Yt −1 ,..., Yt − p | X t −1 ,..., X t − p
– Permutation tests for non-causality with HSNCIC = WˆY&(&XN −|Yp +1)
p p
– N = 100. Significance level α = 5%
2
HS
Ratio of accepting Non-Causality (/100 experiments)
XÆY
100
90
80
70
60
50
40
30
20
10
0
Y Æ X (non-causal)
HSNCIC
Granger
causal area
0
0.1 0.2 0.3 0.4 0.5 0.6
γ
100
90
80
70
60
50
40
30
20
10
0
HSNCIC
Granger
0
0.1 0.2 0.3 0.4 0.5 0.6
γ
3-20
Causal Inference from
Non-experimental Data
„ Why is it possible?
X
V-structure
X
Y
Y
and
Z
– All the directions are not distinguishable.
Z
X
Z
Y
Y
X
p(x|z)p(y|z)p(z) = p(x|z)p(z|y)p(y) =
X
Y |Z
X
Z
Y
p(x|z)p(z|y)p(x) = p(x,y,z)
– Constraint-based causal learning
• Determine the cond. independence of the underlying probability.
• Markov assumption: data is generated by a DAG.
3-21
Causal Leaning
„ Inductive causation (IC, Verma&Pearl 90)
– Basic idea:
• Make a list of all conditional independence /dependence
relations among variables.
• Make an undirected graph under Markov assumption.
• Make directions of the edges by finding V-structure.
– PC algorithm (Peter Sprites & Clark Glymour 1991)
• Efficient implementation of IC.
• Gaussian or discrete assumptions for the cond. indep. tests.
„ Kernel Causal Learning (KCL, Sun et al. ICML2007)
– Kernel test for conditional independence for both of continuous and
discrete variables.
– Make directions by voting.
3-22
„ Experiment: Montana Economic Outlook Poll
– Data: 7 discrete variables, N = 209
AGE (3), SEX (2), INCOME (3), POLITICAL (3), AREA (3),
FINANCIAL status (3, better/same/worse than a year ago), OUTLOOK (2)
SEX
SEX
AGE
INCOME FINANCIAL AREA
INCOME FINANCIAL AREA
SEX
OUTLOOK
FCI
AGE
AGE
POLITICAL
INCOME FINANCIAL AREA
OUTLOOK
OUTLOOK
POLITICAL
KCL
POLITICAL
BN-PC
BN-PC is a constraint-based method using mutual information (Chen et al. 2002)
FCI is the fast IC algorithm which allows hidden variables. (Spirtes et al 1993)
3-23
Summary of Part 3
„ Conditional independence with kernels
– Conditional covariance on RKHS characterizes conditional
independence.
– HS-norm for finite sample gives a kernel measure of conditional
independence
– Kernel method gives a unified method of conditional independence
test for continuous and discrete variables.
„ Causal inference with kernels
– Kernel conditional independence test are applied to causal
inference, such as
• causality of time series (extension of Granger causality)
• causal inference from non-experimental data
(constrained-based causal learning).
3-24
References
Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and
Statistics. Kluwer Academic Publishers. (2004).
Cheng, J., R. Greiner, J. Kelly, D. A. Bell, and W. Liu. Learning Bayesian networks from data:
An information-theory based approach. Artificial Intelligence Journal, 137:43–90, 2002.
Fukumizu, K., A. Gretton, X. Sun., and B. Schölkopf. Kernel Measures of Conditional
Dependence. Advances in NIPS 20:489-496 (2008)
Fukumizu, K., F. Bach and M. Jordan. Kernel dimension reduction in regression. Tech. Report
715, Dept. Statistics, University of California, Berkeley, 2006.
Granger, C. W. J. Investigating causal relations by econometric models and cross-spectral
methods. Econometrica, 37:424-438 (1969).
Spirtes, P. and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social
Science Computer Review 9:62-72.
Spirtes, P., C. Glymour, and R. Scheines. Causation, prediction, and search. Springer-Verlag,
New York, NY, 1993.
Sun, X., D. Janzing, B. Schölkopf, and K. Fukumizu. A kernel-based causal learning algorithm.
Proc. 24th Intern. Conf. Machine Learning (ICML2007), pp.855-862. (2007)
Verma, T., J. Pearl. Equivalence and synthesis of causal models. Proc. 6th Conf. Uncertainty
in Artificial Intelligence (UAI1990) pp.220-227 (1990)
Pearl, J. Causality. Cambridge University Press (2000)
Edwards, D. Introduction to graphical modelling. Springer verlag, New York (2000).
3-25
Related documents