Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture Slides for ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Parametric Estimation X = { xt }t=1N where xt ~ p(x) Here x is one dimensional and the densities are univariate. Parametric estimation: Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ2) where θ = { μ, σ2} Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3 Maximum Likelihood Estimation Likelihood of θ given the sample X l (θ|X ) p(X |θ) = ∏ t=1N p (xt|θ) Log likelihood L(θ|X) log l (θ|X) = ∑ t=1N log p (xt|θ) Maximum likelihood estimator (MLE) θ* = arg maxθ L(θ|X) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4 Examples: Bernoulli Density Two states, failure/success, x in {0,1} P (x) = px (1 – p ) (1 – x) l (θ|X )= ∏ t=1N p (xt|θ) L(p|X) = log l (θ|X) = log ∏t pxt (1 – p ) (1 – xt) = (∑t xt ) log p+ (N – ∑t xt ) log (1 – p ) MLE: p = (∑t xt ) / N MLE: Maximum Likelihood Estimation Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5 Examples: Multinomial Density K > 2 states, xi in {0,1} P (x1,x2,...,xK) = ∏i pixi L(p1,p2,...,pK|X) = log ∏t ∏i pixit where xit = 1 if experiment t chooses state i xit = 0 otherwise MLE: pi = (∑t xit ) / N PS. K: mutually exclusive Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 6 Gaussian (Normal) Distribution p(x) = N ( μ, σ2) x 2 1 p x exp 2 2 2 x 2 1 p x exp 2 2 2 Given a sample X = { xt }t=1N with xt ~ μ σ N ( μ, σ2) , the log likelihood of a Gaussian sample is N , | x log( 2 ) N log L x t 2 t 2 2 2 (Why? N samples?) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 7 Gaussian (Normal) Distribution MLE for μ and σ2: x 2 1 p x exp 2 2 2 m s2 μ x t t N x t m 2 t N σ ( Exercise!) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8 Bias and Variance Unknown parameter θ Estimator di = d (Xi) on sample Xi Bias: bθ(d) = E [d] – θ Variance: E [(d–E [d])2] θ If bθ(d) = 0 d is an unbiased estimator of θ If E [(d–E [d])2] = 0 d is a consistent estimator of θ Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9 10 Expected value • If the probability distribution of X admits a probability density function f (x), then the expected value can be computed as • It follows directly from the discrete case definition that if X is a constant random variable, i.e. X = b for some fixed real number b, then the expected value of X is also b. • The expected value of an arbitrary function of X, g(X), with respect to the probability density function f(x) is given by the inner product of f and g: http://en.wikipedia.org/wiki/Expected_value Bias and Variance For example: t xt 1 E[m] E N N E[ xt ] t t xt 1 Var[m] Var 2 N N N N N 2 2 t Var[ x ] N 2 N t Var [m] 0 as N∞ m is also a consistent estimator Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11 Bias and Variance For example: (see P. 65-66) N 1 2 2 E (s ) N s2 is a biased estimator of σ2 (N/(N-1))s2 is a unbiased estimator of σ2 2 Mean square error: r (d,θ) = E [(d–θ)2] (see P. 66, next slide) = (E [d] – θ)2 + E [(d–E [d])2] = Bias2 + Variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) θ 12 Bias and Variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13 14 Standard Deviation • In statistics, the standard deviation is often estimated from a random sample drawn from the population. • The most common measure used is the sample standard deviation, which is defined by where variable X) and is the sample (formally, realizations from a random is the sample mean. http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation Bayes’ Estimator Treat θ as a random variable with prior p(θ) Bayes’ rule: p( X | ) p( ) p( | X ) p( X ) Maximum a Posteriori (MAP): θMAP = arg maxθ p(θ|X) Maximum Likelihood (ML): θML = arg maxθ p(X|θ) Bayes’ estimator: θBayes = E[θ|X] = ∫ θ p(θ|X) dθ Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15 MAP VS ML If p(θ) is an uniform distribution then θMAP = arg maxθ p(θ|X) = arg maxθ p(X|θ) p(θ) / p(X) = arg maxθ p(X|θ) = θML θMAP = θML where p(θ) / p(X) is a constant Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16 Bayes’ Estimator: Example If p (θ|X) is normal, then θML = m and θBayes =θMAP Example: Suppose xt ~ N (θ, σ2) and θ ~ N ( μ0, σ02) N p X | 1 (2 ) N /2 N exp[ t 2 ( x ) t 1 2 2 ] θML = m 0 2 1 p exp 2 2 0 2 0 θBayes = 1/ 02 N / 2 E |X m 0 2 2 2 2 N / 1/ 0 N / 1/ 0 The Bayes’ estimator is a weighted average of the prior mean μ0 and the sample mean m. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17 Parametric Classification The discriminant function g i x px | Ci PCi or equivalent ly g i x log px | Ci log PCi Assume that px | Ci are Gaussian x i 2 1 p x | Ci exp 2 2i 2i x i log P C 1 gi x log 2 log i i 2 2 2i 2 log likelihood of a Gaussian sample Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18 t t N Given the sample X {x ,r }t 1 1 t ri 0 x if x t Ci if x t C j , j i Maximum Likelihood (ML) estimates are Pˆ Ci ri t N t , mi x r t t i t ri t , si2 x t t 2 mi rit t t r i t Discriminant becomes x mi log P̂ C 1 gi x log 2 log si i 2 2 2si 2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 19 The first term is a constant and if the priors are equal, those terms can be dropped. Assume that variances are equal, then x mi 1 ˆ C gi x log 2 log si log P i 2 2si 2 2 becomes gi x x mi 2 | x mk | Choose Ci if | x mi | min k Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20 Equal variances Single boundary at halfway between means Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold of decision. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21 Variances are different Two boundaries Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are unequal and the posteriors intersect at two points. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22 Exercise For a two-class problem, generate normal samples for two classes with different variances, then use parametric classification to estimate the discriminant points. Compare these with the theoretical values. PS. You can use normal sample generation tools. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23 Regression r f x estimator : g x| ~N 0, 2 p r|x ~N g x| , 2 p r|x : the probability of the output given the input Given an sample X = the log likelihood is N t t { xt , rt}t=1N, L |X log p x , r t 1 N t Regression assumes 0 mean Gaussian noise added to the model; Here, the model is linear. N log p r |x log p x t t 1 t t 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24 Regression: From LogL to Error Ignore the second term: (because it does not depend on our estimator) N L |X log t 1 r t g x t | 2 1 exp 2 2 2 1 N t N log 2 2 r g x t | 2 t 1 2 Maximizing this is equivalent to minimizing 1 N t E |X r g x t | 2 t 1 2 Least squares estimate Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25 Example: Linear Regression Let g x t |w1 , w0 w1 x t w0 1 N t Minimize E |X r g x t | 2 t 1 r We can obtain t t r x t and t N A t x t x t Nw0 w1 x t t t (Exercise!!) w0 x w1 x t t t t rt w0 t , w , y 2 rt xt w1 xt t 2 t t 2 Aw = y w A 1 y Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26 Example: Polynomial Regression g x |wk ,..., w2 , w1 , w0 wk x t k t ... w2 x t 2 w1 x t w0 Aw = y 1 x 1 2 1 x Let D : N 1 x x 2 x1 2 2 x 2 N ... x ... r1 k 2 2 , r r : N 2 r N x x1 ... k We can obtain A D D , y D r , and w D D T T T 1 DT r (see page 75) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27 Other Error Measures Square Error: 1 N t E |X r g x t | 2 t 1 2 N Relative Square Error: E |X r t t 1 N r t 1 Absolute Error: g x | t 2 2 t r E (θ|X) = ∑t |rt – g(xt|θ)| ε-sensitive Error: E (θ|X) = ∑ t 1(|rt – g(xt|θ)|>ε) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28 Bias and Variance The expected square error at a particular point x wrt to a fixed g(x) and variations in r based on p(r|x): E[(r - g(x)) |x]= E[(r -E[r|x]) |x]+(E[r | x]- g(x)) 2 Estimate for the error at point x 2 2 noise (See Eq. 4.17) squared error Now let’s note that g(.) is a random variable (function) of samples S: ES [E[(r - g(x)) | x]]= E[(r -E[r | x]) | x]+ES [(E[r | x]- g(x)) ] 2 Expectation of our estimate for the error at point x (wrt sample variation) 2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2 29 Bias and Variance Now let’s note that g(.) is a random variable (function) of samples S: ES [E[(r - g(x))2 | x]]= E[(r -E[r | x])2 | x]+ES [(E[r | x]- g(x))2 ] The expected value (average over samples X, all of size N and drawn from the same joint density p(x, r)) : (See Eq. 4.11) squared error ES [(E[r | x]- g(x))2 | x]= (See page 66 and 76) (E[r | x]- ES [g(x)])2 + ES [( g(x)- ES [g(x)])2 ] bias2 variance squared error = bais2+variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31 Estimating Bias and Variance Samples Xi={xti , rti}, i =1,...,M, t = 1,2,…,N are used to fit gi (x), i =1,...,M 1 g x gi x M i 2 1 2 t t Bias g g x f x N t 2 1 t t Variance g gi x g x NM t i θ Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32 Bias/Variance Dilemma Examples: gi (x) = 2 has no variance and high bias gi (x) = ∑t rti/N has lower bias with variance As we increase model complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma (Geman et al., 1992) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33 f f ( x) 2 sin( 1.5 x) f bias gi g variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 34 Polynomial Regression Best fit “min error” underfitting overfitting Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 35 Best fit, Fig. 4.7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 36 Model Selection (1) Cross-validation: Measure generalization accuracy by testing on data unused during training To find the optimal complexity Regularization (調整): Penalize complex models E’ = error on data + λ model complexity Structural risk minimization (SRM): To find the model simplest in terms of order and best in terms of empirical error on the data Model complexity measure: polynomials of increasing order, VC dimension, ... Minimum description length (MDL): Kolmogorov complexity of a data set is defined as the shortest description of data Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37 Model Selection (2) Bayesian Model Selection: Prior on models, p(model) p data | model p model p model | data p data Discussions: When the prior is chosen such that we give higher probabilities to simpler models, the Bayesian approach, regularization, SRM, and MDL are equivalent. Cross-validation is the best approach if there is a large enough validation dataset. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38 Regression example Coefficients increase in magnitude as order increases: 1: [-0.0769, 0.0016] 2: [0.1682, -0.6657, 0.0080] 3: [0.4238, -2.5778, 3.4675, -0.0002 4: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019] Please compare with Fig. 4.5, p. 78. regularization : E w|X 2 1 t r g x t |w i wi2 2 t 1 N Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39