Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
9/10/07 Tutorial on Gaussian Processes DAGS ’07 Jonathan Laserson and Ben Packer Outline Linear Regression Bayesian Inference Solution Gaussian Processes Gaussian Process Solution Kernels Implications Linear Regression Task: Predict y given x xi ; yi d x y1 x y 2 X ; y T xM yM T 1 T 2 Linear Regression Predicting Y given X y wT x ~ N (0, 2 ) y | x, w ~ N (w x, ) T 2 wML arg max P( y | x, w) wLS arg min (w T y * wML x* T i xi yi ) 2 L2 Regularized Lin Reg Predicting Y given X y w x T ~ N (0, 2 ) 2 w ~ N (0, I ) wMAP arg max P( y, w | x) wRLS arg min T 2 2 ( w x y ) || w || i i i T y* wMAP x* Bayesian Instead of MAP Instead of using wMAP = argmax P(y,w|X) to predict y*, why don’t we use entire distribution P(y,w|X) to estimate P(y*|X,y,x*)? We have P(y|w,X) and P(w) Combine these to get P(y,w|X) Marginalize to get P(y|X) Same as P(y,y*|X,x*) Conditional Gaussian->Joint to get P(y*|y,X,x*) Bayesian Inference We have P(y|w,X) and P(w) y | X , w ~ N ( Xw, 2 ) w ~ N (0, 2 ) Combine these to get P(y,w|X) 2 0 2 I X w, y | X ~ N , 2 T 0 X 2 I 2 XX T Marginalize to get P(y|X) Same as P(y,y*|X,x*) Joint Gaussian->Conditional Gaussian y | X ~ N (0, 2 I 2 XX T ) y* | y, X , x* ~ N ( y* , y* ) Error bars! Gaussian Process We saw a distribution over Y directly y | X ~ N (0, 2 I 2 XX T ) Why not start from here? Instead of choosing a prior over w and defining fw(x), put your prior over f directly Since y = f(x) + noise, this induces a prior over y Next… How to put a prior on f(x) What is a random process? It’s a prior over functions A stochastic process is a collection of random variables, f(x), indexed by x It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) In a consistent way! What is a Gaussian process? It’s a prior over functions A stochastic process is a collection of random variables, f(x), indexed by x It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) In a consistent way! The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian What is a Gaussian Process? It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) In a consistent way! The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian Enough to specify mean and covariance functions μ(x) = E[f(x)] C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ] f(x1), …, f(xk) ~ N( [μ(x1) … μ(xk)], K) Ki,j = C(xi, xj) For simplicity, we’ll assume μ(x) = 0. Back to Linear Regression Recall: Want to put a prior directly on f Can use a Gaussian Process to do this How do we choose μ and C? Use knowledge of prior over w w ~ N(0, σ2I) μ(x) = E[f(x)] = E[wTx] = E[wT]x = 0 C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ] = E[f(x)f(x’)] Can have = xTE[wwT]x’ f(x) = WTΦ(x) = xT(σ2I)x’ = σ2xTx’ Back to Linear Regression μ(x) = 0 C(x,x’) = σ2xTx’ f ~ GP(μ,C) It follows that f(x1),f(x2),…,f(xk) ~ N(0, K) y1,y2,…,yk ~ N(0,ν2I + K) K = σ2XXT Same as Least Squares Solution! If we use a different C, we’ll have a different K Kernels If we use a different C, we’ll have a different K What do these look like? Linear Poly Gaussian C(x,x’) = σ2xTx’ Kernels If we use a different C, we’ll have a different K What do these look like? Linear Poly Gaussian C(x,x’) = (1+xTx’)2 Kernels If we use a different C, we’ll have a different K What do these look like? Linear Poly Gaussian C(x,x’) = exp{-0.5*(x-x’)2} End C ( x*, x ) i i i Learning a kernel Parameterize a family of kernel functions using θ Learn K using gradient of likelihood y | X ~ N (0, 2 I K ) log p( y | X , ) 0.5 log det( K 2 I ) 0.5 y T ( K 2 I ) 1 y 0.5 n log( 2 ) K K GP Graphical Model Starting point For details, see Rasmussen’s NIPS 2006 Tutorial Williamson’s Gaussian Processes paper http://www.kyb.mpg.de/bs/people/carl/gpnt06.pdf http://www.dai.ed.ac.uk/homes/ckiw/postscript/hbtnn.ps.gz GPs for classification (approximation) Sparse methods Connection to SVMs Your thoughts…