Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Painless embeddings of distributions: the function space view Part 3 - Conditional independence with kernels Arthur Gretton (MPI), Alex Smola (NICTA) , Kenji Fukumizu (ISM) [email protected] ICML 2008 Tutorial July 5, Helsinki, Finland 3-1 Outline of Part 3 I. Introduction II. Conditional independence with kernels III. Application to causal inference IV. Summary 3-2 I. Introduction 3-3 Functional Space View Embedding into RKHS Φ (X) = k( , X) X Ω (original space) Φ feature map H (RKHS) Basic statistics on RKHS – Mean element Æ characterizes a probability – Covariance operator Æ independence/dependence – Conditional covariance operator Æ conditional independence/dependence 3-4 Conditional Independence Definition X, Y, Z: random variables with joint probability density p XYZ ( x, y, z ) X and Y are conditionally independent given Z, if pY |ZX ( y | z , x) = pY |Z ( y | z ) or X Z Y If Z is known, the information of X is not necessary to predict Y. p XY |Z ( x, y | z ) = p X |Z ( x | z ) pY |Z ( y | z ) Z X Y 3-5 Example Applications in statistical inference – Graphical modeling: Separability in a graph implies conditional independence – Causal inference A formulation of causality is given by conditional independence Example: Time series – X causes Y? Yt-1 Yt Y Non-causality p(Yt | Yt-1 , Xt-1) = p(Yt | Yt-1) ? X Xt-1 Xt Yt Xt-1 | Yt-1 ? 3-6 II. Conditional independence with kernels 3-7 Review: Conditional Independence for Gaussian Variables Conditional covariance of Gaussian variables – ( X , Y , Z ) : multidimensional jointly Gaussian variable – Conditional covariance matrix −1 VYX |Z ≡ Cov[Y , X | Z = z ] = VYX − VYZVZZ VZX VXY etc.: covariance matrix Note: VYX|Z does not depend on the value of z Conditional independence for Gaussian variables X Y |Z ⇔ VXY |Z = O i.e. −1 VYX − VYZVZZ VZX = O 3-8 Conditional Covariance on RKHS Conditional Cross-covariance operator X, Y, Z : random variables taking values on X, Y, Z (resp.). (HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on X, Y, Z (resp.). – Conditional cross-covariance operator H X → H Y ΣYX |Z ≡ ΣYX − ΣYZ Σ −1 ZZ Σ ZX ( ΣYX etc.: covariance operators) −1 c.f. VYX |Z = VYX − VYZVZZ VZX −1 Note: Σ ZZ may not exist. But, we have the decomposition /2 ΣYX = Σ1YY/ 2WYX Σ1XX Rigorously, define with || WYX ||≤ 1 /2 ΣYX |Z ≡ ΣYX − Σ1YY/ 2WYZWZX Σ1XX 1/ 2 1/ 2 T T * A = UΛ U if A = UΛU . 3-9 Conditional Covariance and Conditional Covariance Operator Cond. cov. operator expresses cond. covariance Theorem (FBJ’06, Sun et al. ’07) X, Y, Z : random variables taking values on X, Y, Z (resp.). (HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on X, Y, Z (resp.). Assume k Z is a characteristic kernel. g , ΣYX |Z f = E [Cov[ g (Y ), f ( X ) | Z ]] or [ ] ΣYX|Z = EZ ∫ ΦY (Y ) ⊗ ΦX ( X )dP( X ,Y | Z ) − EZ [∫ ΦY (Y ) ⊗ ΦX ( X )dP( X | Z )dP(Y | Z )] = μ XY − μ EZ [Y X |Z ] – c.f. for Gaussian variables aT VXY |Z b = Cov[aT X , bT Y | Z ] (not dependent on the value of z) 3-10 Conditional Independence with Kernels (FBJ2004, FBJ2006, Sun et al. 2007) Extended variables are used. X&& = ( X , Z ), Y&& = (Y , Z ) kX&& = kX k Z , kY&& = kY k Z Theorem (FBJ’06, Sun et al. ’07) Assume the kernels kX&& , kY , and k Z are characteristic. Then, X Y |Z ⇔ ΣYX&&|Z = O – c.f. for Gaussian variables, X (⇔ Y |Z ΣY&&X |Z = O ⇔ ΣY&&X&&|Z = O ⇔ ) VXY |Z = O – With characteristic kernels, comparison between the (conditional) mean elements on RKHS characterizes conditional independence. – Why is the “extended variable” needed? ΣYX |Z = O ⇒ p ( x, y ) = ∫ p ( x | z ) p ( y | z ) p ( z )dz ΣY [ X ,Z ]|Z = O ⇒ p ( x, y, z ' ) = ∫ p ( x, z '| z ) p ( y | z ) p ( z )dz where p ( x, z '| z ) = p( x | z )δ ( z '− z ) 3-11 Measure of Conditional Independence Hilbert-Schmidt norm of cond. cov. operator HSCIC = Σ X&&Y&&|Z With characteristic kernels, 2 X&& = ( X , Z ), Y&& = (Y , Z ) HS HSCIC = 0 ⇔ Y |Z X Empirical estimation is painless! (X1, Y1, Z1), ... , (XN, YN, ZN) : data ( Σ X&&Z → Σˆ (&N& ) = 1 ∑iN=1 (kX&& ( ⋅ , X&&i ) − mˆ X&& ) ⊗ (k Z ( ⋅ , Z i ) − mˆ Z ), Σ −1 → Σˆ ( N ) + ε I XZ ZZ N ZZ N 2 −1 ( N ) 2 N N N ( ) ( ) ( ) HSCIC = Σ && && Æ HSCICemp = Σˆ &&&& − Σˆ && Σˆ ZZ + ε N I Σˆ && XY |Z HS [ YX YZ ( −1 ~ ~ ~ ~ ~ ~ HSCICemp = Tr K X&& KY&& − 2 K X&& (K Z + Nε N I N ) K Z KY&& ) ZX ) −1 HS −1 −1 ~ ~ ~ ~ ~ + K Z (K Z + Nε N I N ) K X&& (K Z + Nε N I N ) K Z KY&& ] 3-12 Normalized Cond. Covariance Normalized conditional cross-covariance operator ( ) −1 / 2 1/ 2 −1 / 2 1/ 2 WYX |Z = ΣYY ΣYX |Z Σ −XX = ΣYY ΣYX − ΣYZ Σ −ZZ1 Σ ZX Σ −XX /2 Recall: ΣYX = Σ1YY/ 2WYX Σ1XX HSNCIC = WX&&Y&&|Z 2 HS HSNCICemp = Tr [RX&& RY&& − 2 RX&& RY&& RZ + RX&& RZ RY&& RZ ] −1 ~ ~ RX&& ≡ K X&& (K X&& + Nε N I N ) etc. Kernel-free expression. || WY&&X&&|Z ||2HS = ∫∫ With characteristic kernels, 2 ⎛ p XYZ ( x, y, z ) − p X |Z ( x | z ) pY |Z ( y | z ) pZ ( z ) ⎞ ⎜⎜ ⎟⎟ p XZ ( x, z ) pYZ ( y, z )dxdydz p XZ ( x, z ) pYZ ( y, z ) ⎝ ⎠ (“Conditional” mean square contingency) 3-13 Conditional Independence Test Background – There are no good methods for conditional independence test on non-Gaussian continuous variables. (e.g. discretizing all variables). TN = HSNCICemp – Partition the values of Z into C1, …, CL, and define Al = {i | Z i ∈ Cl } (l = 1,..., L). – Resampling (for b = 1,2,…) 1. Generate pseudo cond. indep. sample D(b) by permuting X within each Al . 2. Compute TN(b) for the sample D(b) . Approximate the null distribution by samples. permute or permute TN = HSCICemp permute Permutation test with the kernel measure { { { X 1,i1 Y1,i1 X 1,i2 Y1,i2 C1 X 1,i3 Y1,i3 X 2,i4 Y2,i4 X 2,i2 Y2,i2 C2 X 2,i6 Y2,i6 … X L ,i7 YL ,i7 X L ,i8 YL ,i8 C L X L ,i9 YL ,i9 – Set the threshold for the significance level (e.g. 5%). 3-14 Application to Graphical Modeling – Three continuous variables of medical measurements. N = 35. (Edwards 2000, Sec.3.1.4) Creatinine clearance (C), Digoxin clearance (D), Urine flow (U) Kernel mehod (permut. test) HSN(C)IC D U| C C D C U D U Linear method P-val. (partial) cor. P-val. 1.458 0.924 Parcor(D,U|C) 0.4847 0.0037 0.776 <0.001 Cor(C,D) 0.7754 0.0000 0.194 0.117 Cor(C,U) 0.3092 0.0707 0.343 0.023 Cor(D,U) 0.5309 0.0010 – Suggested undirected graphical model by kernel method D U The conditional independence D U | C coincides with the medical knowledge. C 3-15 III. Application to causal inference 3-16 Causal Inference With manipulation – intervention X is a cause of Y? X manipulate Y observation Easier. (do-calculus, Pearl 1995) No manipulation / with temporal information X (t ) Y (t ) : observed time series X(1), …, X(t) are a cause of Y(t+1)? No manipulation / no temporal information X Causal inference is harder. Y 3-17 Causality of Time Series Causality by conditional independence – Extended notion of Granger causality (linear AR) X is NOT a cause of Y if p (Yt | Yt −1 ,..., Yt − p , X t −1 ,..., X t − p ) = p (Yt | Yt −1 ,..., Yt − p ) Yt-1 Yt Yt X t −1 ,..., X t − p | Yt −1 ,..., Yt − p X – Kernel measures for causality HSCIC = Σˆ Y(&&NXp−|Yp +p 1) HSNCIC = Y 2 Xt-1 Xt HS 2 ( N − p +1) ˆ WY&&Xp |Yp HS X p = {( X t −1, X t −2 ,L, X t − p ) ∈ R p | t = p + 1,..., N } Yp = {(Yt −1,Yt −2 ,L, Yt − p ) ∈ R p | t = p + 1,..., N } 3-18 Example Coupled Hénon map Yt-1 Yt – X, Y: Y γ ⎧ x1 (t + 1) = 1.4 − x1 (t ) 2 + 0.3 x2 (t ) X ⎨ ⎩ x2 (t + 1) = x1 (t ) ⎧⎪ y1 (t + 1) = 1.4 − γ x1 (t ) y1 (t ) + (1 − γ ) y1 (t ) 2 + 0.1 y2 (t ) ⎨ ⎪⎩ y2 (t + 1) = y1 (t ) { 2 } 2 x1-x2 Xt-1 Xt x1-y1 2 2.5 2 1.5 1.5 1.5 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 -0.5 -0.5 -0.5 -1 -0.5 -1 -1.5 -2 -2 -1 0 1 2 -1 -2 -1 0 γ=0 1 2 -1.5 -2 -1 -1 0 γ = 0.25 1 2 -1.5 -2 -1 0 γ = 0.8 1 3-19 2 Causality of coupled Hénon map – X is a cause of Y if γ > 0. Yt – Y is not a cause of X for all γ. X t −1 ,..., X t − p | Yt −1 ,..., Yt − p Xt Yt −1 ,..., Yt − p | X t −1 ,..., X t − p – Permutation tests for non-causality with HSNCIC = WˆY&(&XN −|Yp +1) p p – N = 100. Significance level α = 5% 2 HS Ratio of accepting Non-Causality (/100 experiments) XÆY 100 90 80 70 60 50 40 30 20 10 0 Y Æ X (non-causal) HSNCIC Granger causal area 0 0.1 0.2 0.3 0.4 0.5 0.6 γ 100 90 80 70 60 50 40 30 20 10 0 HSNCIC Granger 0 0.1 0.2 0.3 0.4 0.5 0.6 γ 3-20 Causal Inference from Non-experimental Data Why is it possible? X V-structure X Y Y and Z – All the directions are not distinguishable. Z X Z Y Y X p(x|z)p(y|z)p(z) = p(x|z)p(z|y)p(y) = X Y |Z X Z Y p(x|z)p(z|y)p(x) = p(x,y,z) – Constraint-based causal learning • Determine the cond. independence of the underlying probability. • Markov assumption: data is generated by a DAG. 3-21 Causal Leaning Inductive causation (IC, Verma&Pearl 90) – Basic idea: • Make a list of all conditional independence /dependence relations among variables. • Make an undirected graph under Markov assumption. • Make directions of the edges by finding V-structure. – PC algorithm (Peter Sprites & Clark Glymour 1991) • Efficient implementation of IC. • Gaussian or discrete assumptions for the cond. indep. tests. Kernel Causal Learning (KCL, Sun et al. ICML2007) – Kernel test for conditional independence for both of continuous and discrete variables. – Make directions by voting. 3-22 Experiment: Montana Economic Outlook Poll – Data: 7 discrete variables, N = 209 AGE (3), SEX (2), INCOME (3), POLITICAL (3), AREA (3), FINANCIAL status (3, better/same/worse than a year ago), OUTLOOK (2) SEX SEX AGE INCOME FINANCIAL AREA INCOME FINANCIAL AREA SEX OUTLOOK FCI AGE AGE POLITICAL INCOME FINANCIAL AREA OUTLOOK OUTLOOK POLITICAL KCL POLITICAL BN-PC BN-PC is a constraint-based method using mutual information (Chen et al. 2002) FCI is the fast IC algorithm which allows hidden variables. (Spirtes et al 1993) 3-23 Summary of Part 3 Conditional independence with kernels – Conditional covariance on RKHS characterizes conditional independence. – HS-norm for finite sample gives a kernel measure of conditional independence – Kernel method gives a unified method of conditional independence test for continuous and discrete variables. Causal inference with kernels – Kernel conditional independence test are applied to causal inference, such as • causality of time series (extension of Granger causality) • causal inference from non-experimental data (constrained-based causal learning). 3-24 References Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers. (2004). Cheng, J., R. Greiner, J. Kelly, D. A. Bell, and W. Liu. Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence Journal, 137:43–90, 2002. Fukumizu, K., A. Gretton, X. Sun., and B. Schölkopf. Kernel Measures of Conditional Dependence. Advances in NIPS 20:489-496 (2008) Fukumizu, K., F. Bach and M. Jordan. Kernel dimension reduction in regression. Tech. Report 715, Dept. Statistics, University of California, Berkeley, 2006. Granger, C. W. J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37:424-438 (1969). Spirtes, P. and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review 9:62-72. Spirtes, P., C. Glymour, and R. Scheines. Causation, prediction, and search. Springer-Verlag, New York, NY, 1993. Sun, X., D. Janzing, B. Schölkopf, and K. Fukumizu. A kernel-based causal learning algorithm. Proc. 24th Intern. Conf. Machine Learning (ICML2007), pp.855-862. (2007) Verma, T., J. Pearl. Equivalence and synthesis of causal models. Proc. 6th Conf. Uncertainty in Artificial Intelligence (UAI1990) pp.220-227 (1990) Pearl, J. Causality. Cambridge University Press (2000) Edwards, D. Introduction to graphical modelling. Springer verlag, New York (2000). 3-25