Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 4: Parametric Methods Parametric Methods A statistic is any value that is calculated from a given sample. In statistical inference, we make a decision using the information provided by a sample. Our first approach is parametric where we assume that the sample is drawn from some distribution that obeys a known model, for example, Gaussian. The advantage of the parametric approach is that the model is defined with a small number of parameters—for example, mean, variance—the sufficient statistics of the distribution. Once those parameters are estimated from the sample, the whole distribution is known. 2 Parametric Methods We estimate the parameters of the distribution from the given sample, plug in these estimates to the assumed model, and get an estimated distribution, which we then use to make a decision. The method we use to estimate the parameters of a distribution is maximum likelihood estimation.We also introduce Bayesian estimation, which we will continue discussing in chapter 14. 3 Parametric Estimation we have an independent and identically distributed (iid) sample X = { xt }t We assume that xt are instances drawn from some known probability density family, p (x |q ), defined with parameters, q such that xt ~ p (x |q) We want to find θ that makes sampling xt from p(x|θ) as likely as possible. For example, we find q = { μ, σ2} for normal (Gaussian) distribution N ( μ, σ2) 4 Normal distribution X is normal or Gaussian distributed with mean μ and variance σ2, denoted as N(μ,σ2), if its density function is The parameter μ in this definition is the mean or expectation of the distribution (and also its median and mode). The parameter σ is its standard deviation; its variance is therefore σ 2. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate. If μ = 0 and σ = 1, the distribution is called the standard normal distribution or the unit normal distribution, and a random variable with that distribution is a standard normal deviate. 5 Normal distribution In probability theory, the normal (or Gaussian) distribution is a very commonly occurring continuous probability distribution—a function that tells the probability that an observation in some context will fall between any two real numbers. For example, the distribution of grades on a test administered to many people is normally distributed. Normal distributions are extremely important in statistics and are often used in the natural and social sciences for real-valued random variables whose distributions are not known. The normal distribution is immensely useful because of the central limit theorem, which states that, under mild conditions, the mean of many random variables independently drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution: physical quantities that are expected to be the sum of many independent processes (such as measurement errors) often have a distribution very close to the normal. http://www.mathsisfun.com/data/standard-normal-distribution.html 6 Maximum Likelihood Estimation Given an independent and identically distributed (iid) sample X, the likelihood of parameter, θ, is the product of the likelihoods of the individual points: In maximum likelihood estimation, we are interested in finding θ that makes X the most likely to be drawn. We thus search for θ that maximizes the likelihood, which we denote by l(θ|X). Maximum likelihood estimator (MLE) : θ* = argmaxθ L(θ|X) 7 Maximum Likelihood Estimation We can maximize the log of the likelihood without changing the value where it takes its maximum. log(·) converts the product into a sum and leads to further computational simplification when certain densities are assumed, for example, containing exponents. The log likelihood is defined as 8 Maximum Likelihood Estimation Likelihood of q given the sample X l (θ|X) = p (X |θ) = ∏t p (xt|θ) Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) 9 Maximum Likelihood Estimators Let as consider some distributions that arise in the applications we are interested in. If we have a two-class problem, the distribution we use is Bernoulli. When there are K > 2 classes, its generalization is the multinomial. Gaussian (normal) density is the one most frequently used for modeling class-conditional input densities with numeric input. For these three distributions, we discuss the maximum likelihood estimators (MLE) of their parameters. 10 Bernoulli Density In a Bernoulli distribution, there are two outcomes: An event occurs or it does not. The Bernoulli random variable X takes the value 1 with probability p, and the nonoccurrence of the event has probability 1 − p and this is denoted by X taking the value 0. This is written as P (x) = pox (1 – po ) (1 – x) 11 The expected value and variance can be calculated as Bernoulli Density 12 Multinomial Density 13 Gaussian (Normal) Density 14 Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1} P (x) = pox (1 – po ) (1 – x) L (po|X) = log ∏t poxt (1 – po ) (1 – xt) MLE: po = ∑t xt / N Multinomial: K>2 states, xi in {0,1} P (x1,x2,...,xK) = ∏i pixi L(p1,p2,...,pK|X) = log ∏t ∏i pixit MLE: pi = ∑t xit / N 15 Gaussian (Normal) Distribution x 2 1 p x exp 2 2 2 x 2 1 px exp 2 2 2 μ σ p(x) = N ( μ, σ2) MLE for μ and σ2: m s2 16 x t t N x t m t N 2 Evaluating an Estimator: Bias and Variance Let X be a sample from a population specified up to a parameter θ, and let d = d(X) be an estimator of θ. To evaluate the quality of this estimator, we can measure how much it is different from θ, that is, (d(X)−θ)2). But since it is a random variable (it depends on the sample), we need to average this over possible X and consider r(d, θ), the mean square error of the estimator d defined as r(d, θ) = E[(d(X) − θ)2] The bias of an estimator is given as bθ(d) = E[d(X)] − θ 17 If bθ(d) = 0 for all θ values, then we say that d is an unbiased estimator of θ. Bias and Variance Unknown parameter q Estimator di = d (Xi) on sample Xi Bias: bq(d) = E [d] – q Variance: E [(d–E [d])2] q Mean square error: r (d,q) = E [(d–q)2] = (E [d] – q)2 + E [(d–E [d])2] = Bias2 + Variance The variance that measures how much, on average, di vary around the expected value (going from one dataset to another), and the second term is the bias that measures how much the expected value varies from the correct value θ 18 Bayes’ Estimator Sometimes, before looking at a sample, we (or experts of the application) may have some prior information on the possible value range that a parameter, θ, may take. This information is quite useful and should be used, especially when the sample is small. The prior information does not tell us exactly what the parameter value is (otherwise we would not need the sample). We model this uncertainty by viewing θ as a random variable and by defining a prior density for it, p(θ). The prior density, p(θ), tells us the likely values that θ may take before looking at the sample. We combine this with what the sample data tells us, namely, the likelihood density, p(X|θ), using Bayes’ rule, and get the posterior density of θ, which tells us the likely θ values after looking at the sample: 19 Bayes’ Estimator p(x|θ,X) = p(x|θ) because once we know θ, the sufficient statistics, we know everything about the distribution. Thus we are taking an average over predictions using all values of θ, weighted by their probabilities. If we are doing a prediction in the form, y = g(x|θ), as in regression, then we have 20 Evaluating the integrals may be quite difficult, except in cases where the posterior has a nice form. When the full integration is not feasible, we reduce it to a single point. Maximum A Posteriori (MAP) Posteriori Estimate Estimate If we can assume that p(θ|X) has a narrow peak around its mode, then using the maximum a posteriori (MAP) estimate will make the calculation easier: 21 By replacing a whole density with a single point, getting rid of the integral, we can estimate the density at x with Maximum Likelihood Estimate If we have no prior reason (information) to favor some values of θ, then the prior density is flat and the posterior will have the same form as the likelihood, p(X|θ), and the MAP estimate will be equivalent to the maximum likelihood estimate (section 4.2) where we have 22 Bayes’ Estimator Another possibility is the Bayes’ estimator, which is defined as the expected value of the posterior density. The reason for taking the expected value is that the best estimate of a random variable is its mean. In the case of a normal density, the mode is the expected value and if p(θ|X) is normal, then θBayes = θMAP 23 Bayes’ Estimator Treat θ as a random var with prior p (θ) Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML): θML = argmaxθ p(X|θ) Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ 24 Bayes’ Estimator: Example Thus the Bayes’ estimator is a weighted average of the prior mean μ0 and the sample mean m, with weights being inversely proportional to their variances. As the sample size N increases, the Bayes’ estimator gets closer to the sample average, using more the information provided by the sample. When σ0 is small, that is, when we have little prior uncertainty regarding the correct value of θ, or when N is small, our prior guess μ0 has a higher effect. 25 Bayes’ Estimator: Example xt ~ N (θ, σ2) and θ ~ N ( μ0, σo2 ) θML = m θMAP = θBayes’ = 1/ 2 N / E q | X m 2 2 2 2 N / 0 1/ N / 0 1/ 2 0 26 Parametric Classification We saw in chapter 3 that using the Bayes’ rule, we can write the posterior probability of class Ci as Use the upper term as the discriminant function or log(.) of discriminant function gi x px |C i P C i or gi x log px |C i log P C i 27 Parametric Classification If we can assume that p(x|Ci) are Gaussian, then x i 2 1 px |C i exp 2 2 i 2 i 1 x i gi x log 2 log i log P C i 2 2 2 i 2 28 Example Assume we are a car company selling K different cars, and for simplicity, let us say that the sole factor that affects a customer’s choice is his or her yearly income, which we denote by x. Then P(Ci) is the proportion of customers who buy car type i. If the yearly income distributions of such customers can be approximated with a Gaussian, then p(x|Ci), the probability that a customer who bought car type i has income x, can be takenN(μi,σi2), where μi is the mean income of such customers and σi2 is their income variance. 29 Estimates for the Means And Variances When we do not know P(Ci) and p(x|Ci), we estimate them from a sample and plug in their estimates to get the estimate for the discriminant function. Given a sample X {x t ,r t }tN1 x t 1 if x C i t ri t 0 if x C j , j i For each class separately, the ML estimates for the means and variances are mi 30 x r t t i t t r i t si2 x t 2 mi rit t t r i t Priors and Discriminant Function The estimates for the priors are (relying on equation 4.6) Pˆ Ci t r i t N Plugging these estimates into equation 4.22, we get the discriminant function: 1 x mi 2 gi x log 2 log si log PˆC i 2 2 2si 31 Assign x to the class with the nearest mean 32 C2 C1 C1 C2 Equal variances C1 C2 Single boundary at halfway between means Figure 4.2 (a) Likelihood functions and (b) posteriors with equal priors for two classes when the input is one-dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold of decision. 33 Variances are different Two boundaries Figure 4.3 (a) Likelihood functions and (b) posteriors with equal priors for two classes when the input is one-dimensional. Variances are unequal and the posteriors intersect at two points. In (c), the expected risks are shown for the two classes and for reject with λ = 0.2 (section 3.3). 34