Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . . . . .. . Review of Bayesian and Frequentist Statistics Bertrand Clarke1 1 Department of Medicine University of Miami NDU 2011 . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Outline . . .1 Basic Definitions . . .2 Models of Variability Survey Sampling Frequentist Bayesian Other Schools . The Normal Example . 3 . . . 4. Suffiency and Exponential Families . . 5. Main Frequentist Estimators Method of Moments MLE’s UMVUE’s Testing .6. Bayesian Estimation . Review of Bayesian and Frequentist Statistics Conjugate Priors B. Clarke . . . . . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Basic Definitions There are several schools of thought in Statistics. They differ primarily in how they treat variability. To explain them we need some definitions. Population: The collection of all outcomes, real or imagined, to which one wants conclusions to apply. May be natural or artificial; must be precise. Examples: All people born on a Tuesday since 1950. All left handed vegetarians employed at NDU (provided vegetarian is precisely defined). All motorized vehicles registered in Lebanon. All Lebanese residents over age 65 as of 1 July 2011. All runs of a specific experiment that might be performed. All strings of zeros and ones. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Population vs sample A lot of work goes into defining a population accurately. A sample is a subset of size n from the population. Samples let us make inferences about the population they represent. We want the population we sampled to be the same as the population we want to study. A frame (if it exists) is the set from which we draw our samples. If we take a sample of businesses that have webpages this will not be the same as the population of all businesses. Even if a sample is drawn from the correct population, it doesn’t mean it is representative. Ideal: Representative sample of the population of interest. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Representative samples Not always possible.... Ideal case: The selection from the population is found by ‘simple random sampling’. A sample of size n is taken and (i) each element of the population has the same chance of inclusion in the sample (ii) the selection of one element for the sample is unaffected by the selection of any other element. This means all individuals and all samples are probabilistically equivalent in the sense that they are the output of the same sampling process. Abbreviated: IID Sometimes samples are dependent or not-identical and we must model this. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Random Variables Random variable: The process by which a measurement is generated, X . Outcome: The measurement generated. X = x The process of obtaining a measurement of, say, the size of a tumor is the random variable X . The measurement is X = .5 cm. A random variable has an associated probability, P. Thus: P(X ∈ A) is the probability that the random variable gives us a value in the set A. We might consider th probability that a tumor of diameter at least .5cm grows. This is P(X ≥ .5). A specific outcome does not have a probability. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Technical Point The way I defined RV is informal. Here’s the real definition: X : (X , F) → (|R, B(|R)) where X is measurable i.e., ∀B ∈ B(|R), X −1 (B) ∈ F. We assign a probability PR to the observation space (the range) and pull it back to give a probability on the underlying measure space. Thus: PD (X −1 (A)) = PR (A) for A ∈ B(|R) and: PR (A) = PD (X ∈ A) = PD (X −1 (A)) = PD (ω ∈ X |X (ω) ∈ A) (neglecting {, }’s) and we usually drop the subscripts on P for convenience. We never see (X , F) and we don’t know much about it....we just more or less assume it works out OK and there are theorems guaranteeing it has the properties we want. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Parameters We often get n outcomes x1 , . . . , xn of a random variable X . We may denote the n draws of X by X1 , . . . , Xn . The collection of outcomes/measurements is the sample. The population is fixed, but we consider different descriptions of it. A description of a population is given by a probability on it. We don’t know the correct P but we may have a collection of probabilities P that we are sure contains the true one. Often P = {Pθ |θ ∈ Θ}, Θ ⊂ IR d . θ is a parameter. We use a function of x1 , . . . , xn to estimate θ. Write θ̂ = θ̂(x n ) where x n = (x1 , . . . , xn ). We write θ̂(X n ), where X n = (X1 , . . . , Xn ) when we want to emphasize that θ̂ can be regarded as a RV in its own right. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Inference The basic problem of inference is to use the data i.e., the sample, to get an estimate θ̂ of θtrue , i.e., to identify the correct description of the population. Sometimes the parameter means something e.g., height of people. Sometimes a parameter is just an index for a collection of probabilities. Here, we won’t usually make a differencbetween a probability, a density, and a dsitribution function since the parameter would identify any of them. Not enough to annouce θ̂...want a description of how θ̂ varies. So, we must understand models of variability. There are 3 major ones and several minor ones. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Survey Sampling The population is finite, size N. The only randomness is in which sample is chosen. An individual, once chosen, generates a measurement with no ambiguity. The X is the selection of an individual from the population. ( ) N There are possible samples of size n and when we n get one we use it to get a point estimate i.e., a θ̂. If we take, say, a mean, then X̄ has a distribution generated by considering the possible samples of size n. So, E(X̄ ) is the sum of possible values of X̄ weighted by the probability of choosing a sample that gives that value. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Variability We can also find Var(X̄ ) = E(X̄ − E(X̄ ))2 Usually must have n << N for decent inference. In this case we might get a confidence interval of the form x̄ ± z1−α/2 FPC √sn . This means that 100(1 − α/2) of the samples of size n that we might get will give an interval of the form x̄ ± z1−α/2 FPC √sn that contains θ = E(X ). The FPC is called the finite population correction. Note that the variabiity is in the sample chosen and we imagine the result of choosing all samples of size n. This is Frequentist in that we consider the effect of repeated samples of size n and invoke the Frequency interpretation of probability. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Frequentist Rests on the Frequentist interpretation of probability. That is, the probability of an event A (such as tossing a coin and getting tails) is the limit # times observed A P(A) = lim . n→∞ # times we looked Not a formal limit (given an ϵ you can’t find an n). Given Pθ and n copies of X we form confidence regions. A 1 − α confidence region is a random set R(X n ) with the property that Pθ (θ ∈ R(X n )) = 1 − α. Note that we have one Pθ for each Xi and another Pθn for X n formed from n copies of Pθ but we don’t bother to distinguish between them. . B. Clarke . . . Review n of Bayesian and Frequentist Statistics n . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Confidence Intervals The question is how to find CR’s. Usual approach is to form an interval. Suppose θ is a mean θ = E(X ). Then, if σ is known, as n → ∞, we can show √ Pθ (σzα/2 ≤ n(X̄ − θ) ≤ z1−α/2 σ) → 1 − α √ That is R(X n ) = {σzα/2 ≤ n(X̄ − θ) ≤ z1−α/2 σ)} is an asymptotic 1 − α CI. √ R(X n ) = {σzα/2 ≤ n(X̄ − θ) ≤ z1−α/2 σ)} and we have one outcome of it (from the n outcomes of X ). Frequentist prediction comes from Pθ̂ (·), i.e., A has is a 1 − α prediction region if Pθ̂ (Xn+1 ∈ A) = 1 − α. (Not exact because it neglects variation in θ̂.) . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Confidence √ Interpretation: x̄ ± zα/2 σ/ n is an interval produced by a technique that ensures 100(1 − α)% of intervals so formed will contain θ. Confidence is a property of the [process of producing the interval,not of the numerical interval itself. The distribution of θ̂ = X̄ is called the sampling distribution. It is the central object for Frequentist inference. Statements about where a parameter lies retain the randomness of the data generating mechansm. It’s as if we never forget that the sample we got came from a RV. The outcome x is what we see. The X is like the process by which we got the outcome. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Standard Error A Frequentist distinguishes between the SD and the SE. An SD is the σ for a single out come of a RV X. This is a property of the population distribution. An SE is the σ for a function of n outcomes of a RV X . This is a property of the sampling distribution. For one X , Var(X ) = σ 2 : The X has a distribution with a density curve and we find the SD. For IID X1 , ... , Xn , Var(X̄ ) = σ 2 /n and this is taken in the sampling distribution for X̄ which is derived from the distribution of X but will be much more peaked around µ. ∑ For INID X1 , ... , Xn , Var(X̄ ) = ni=1 σi2 /n and for dependent variables, all bets are off. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . In Practice A Frequentist chooses a class of densities f (x|θ) integrating to 1 for each θ; f (x n |θ) = f (x1 |θ) · · · f (xn |θ). The MLE is a standard estimator: θ̂ = arg max f (x n |θ). θ Many choices of f (·|θ) have a sufficient statistic: Poisson, Binomial, normal, exponential... Definition of sufficiency: T (X ) is sufficent for θ in f (x|θ) ⇐⇒ inference on θ only depends on T . Sufficient statistics contain all the information about θ in the data so functions of them are good estimators. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Other Frequentist Techniques A statistic T is unbiased for θ if and only if Eθ T (X n ) = θ. 1 CRLB for unbiased Statistics: Varθ (T (X n )) ≥ In (θ) . UMVUE: Any statistic that achieves the CRLB for all θ in an interval. It turns out that UMVUE’s are unique and can be given as functions of sufficient statistics. Given X1 , . . . , Xn put them in order from smallest to largest: X(1) , . . . , X(n) . L-Statistics: Linear combinations of order statistics. Decision theoretic statistics....Covariates... These are all ways to find statistics to generate a point estimate and a sampling distribution and hence CI. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Bayesian The Frequentist assumes θ is fixed and the data retain their stochastic character (via Frequency interpretation). Bayesians reverse this: θ is a random variable Θ and the data, once obtained are treated as fixed. So, you condition on them. Where does the distribution on Θ come from? We make it up. Call it the density w(θ). We still have the conditional density for X given Θ = θ that we write as f (·|θ). Joint density for Θ, X n ) is w(θ)f (x n |θ) = w(x n |θ)m(x n ) ∫ where m(x n ) = w(θ)f (x n |θ)dθ is called the mixture distribution or the marginal for the data. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Bayesian Inference Bayesians make inferences from the posterior w(θ|x n ). A 1 − α credible set R = Rα (x n ) is any set of parameter values satisfying W (R|x n ) = 1 − α. This does not require the Frequency interpretation. The interpretation of a credibility region R is that conditional on the data, we have a set that contains 1 − α posterior probability. The posterior density is the Bayesian’s analog to the sampling distribution. No repeated sampling assumption (which might not be satisfied) just a direct statement about where θ is – conditional on the data. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Choosing the prior There are two approaches to prior selection. Subjective: (snide) The investigator consults his/her feelings and impressions about where θ might be and draws curves to represent this trying to find a mathematical form that they fit. Subjective: (fair) The investigator reflects carefully on the relevant information about θ that might be available and tries to formulate a prior density that summarises this. Objective: The prior is chosen by some kind of auxilliary principle, e.g., noninformativity, usually an optimization or invariance criterion. Choose a class of priors and evaluate the stability of inferences to over the class. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Types of Bayes Estimator Decision theoretic: Choose a loss function L(·, ·) and find ∫ n δB (x ) = arg min L(θ, δ(x n ))w(θ|x n )dθ δ The integral is called the posterior risk of δ. Posterior mode: Analogous to the MLE, choose θ̂PM = arg max w(θ|x n ) θ Actually, |θ̂MLE − θ̂PM | = OP (1/n). Conventionally, the Bayesian wants to see the whole posterior because the shape of the curve explains the variability better than Var(Θ|X = x) can. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Variability As noted, given w(θ|X n ), the Bayesian might use the posterior variance ∫ (θ − E(Θ|X n = x n ))2 w(θ|x n )dθ ∫ where E(Θ|x n ) = θw(θ|x n )dθ is the posterior mean. √ Just as X̄ → µ and (X̄ − µ)/(σ/ n) → N(0, 1), posterior quantities have analogous properies. Bayesian LLN: E(Θ̄|X n ) → µ. √ √ Bayesian CLT: w((θ − θ̂)/( n I(θ̂)) → N(0, 1). (Note θ̂ = E(Θ̄|X n ) is one choice.) . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Where is the variability? Importance of variability: 1 m ± 1cm vs, 1m ± 1 km. The Bayesian says the data, once obtained, are no longer stochastic. They are the fixed outcomes of a RV and so you condition on them. The Bayesian says the variability is transmuted from the data to the parameter by way of the posterior distribution for the parameter that is conditional on the data. Thus, the Survey Sampler thinks in terms of subsets of a specific population; a Frequentist thinks in terms of repeated sampling; a Bayesian thinks of what the resuling posterior says about the parameter given the data. Outcomes remain stochastic for the Frequentist, not the Bayesian. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Special Cases Conjugate priors: Choose a prior from a class so that the posterior is in the same class. Depends on the likelihood. Non-parametric Bayes is exactly the same structure as parametric Bayes. Bayesian prediction comes from the predictive distribution: ∫ m(xn+1 |x n ) = f (xn+1 |θ)w(θ|x n )dθ. That is, A is a 1 − α prediction region if Mn+1 (Xn+1 ∈ A|x n ) = 1 − α, still conditional on x n , like Frequentist case. In IID cases, Bayes and Frequentist methods for estimation are often asymptotically equivalent. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Testing Bigger differences in hypothesis testing: The p-value is obtained from the sampling distribution and has a very different meaning than a Bayes factor. Bayes testing is ]decision-theoretic using 0-1 loss. Bayes testing of H0 : θ ∈ S vs H1 : θ ∈ S c based on W (S|x n ) or, equivalently, on the Bayes factor BF (1; 2) = W (S|x n )/W (S) . W (S c |x n )/W (S c ) This is the ratio of the prior odds to the posterior odds. Contrast: Frequentist testing uses the Neyman-Pearson Lemma which is an optimization of P(reject H0 |H1 true ) subject to P(reject H0 |H0 true ) ≤ 1 − α, both proabilities are in the sampling distribution of the test statistic. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Likelihood; Information-Theoretic Likelihood = conditional density for X given θ but regarded as a function of θ for fixed X = x, L(θ|x) = f (x|θ). LP: All inferences should come only from the Likelihood. Get intervals like {θ| L(θ̂|x n ) − L(θ|x n ) ≤ t} for thresholds t. No notion of confidence or credibility. Information-theoretic: The idea is that the central features of models and data are expressible in terms of measures of complexity (Kolmogorov, VC-dimension, codelength). e.g., choose a model p̂ p̂ = arg min L(p) + L(x n |p) p∈P where L(·) is the Shannon codelength for x n given p or p. Includes maxent, rel. entropy criteria, MML, MDL, etc etc. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Predictive Prequential Principle, Dawid (1984). ‘The method of evaluation of a predictor should be disjoint from its method of construction, e.g., depend only on the predictions it makes and the future data.’ Typically look at something like CPE = n2 ∑ (Ŷi+1 (xi+1 ; x i ) − Yi+1 (xi+1 ))2 . i=n1 as a way to evaluate a predictor Ŷi+1 . Inference comes from prediction errors. Other principles: variance/bias, robustness, complexity etc. Predictive criteria are most important with complex and high dimensional data where modeling is impossible. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Survey Sampling Frequentist Bayesian Other Schools . Another perspective... Fiducialist: Wang, Hannig, Iyer (2011). This was a weird idea due to Fisher that never worked but from time to time people try to make it work. Regard X as X = G(θ, U), U = error distribution. Define Q(x, u) = {θ| G(θ, u) = x}; Q is like a G−1 for fixed θ. Fiducial distribution for θ is: Q(x, U ∗ )|Q(x, U ∗ ) ̸= ϕ), where U is an IID copy of U. Still fairly complicated and under development, but an interesting alternative. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Frequentist X ∼ N(µ, σ 2 ) has density 1 2 2 f (x|µ, σ) = √ e−(x−µ) /2σ . 2πσ n n n IID outcomes X = x satisfy: σ2 1 ∑ ) (n − 1)S 2 /σ 2 = 2 (xi − x̄)2 ∼ χ2n−1 n σ i−1 √ 2 X̄ and √ S are independent. So, n(X̄ − µ)/σ ∼ N(0, 1) and n(X̄ − µ)/s √ ∼ tn−1 gives CI’s for µ and σ. tn−1 = N(0, 1)/ χ2 /n; tends to N(0, 1) as n → ∞ but otherwise has heavier tails; tk +1 has moments of order at most k (tk+1 ∼ O(x k+1 )). n X̄ ∼ N(µ, . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Bayes version X IID N(µ, σ 2 )...assume σ fixed, estimate µ. (If we use X̄ , set σ 2 = σ ′2 /n.) Use a prior w(µ) on µ: µ ∼ N(θ, τ 2 ), θ, τ known. Joint distribution: h(µ, x) = w(µ)f (x|µ) = 1 2 2 e−(1/2)[(µ−θ)/τ +(x−µ)/σ ] 2πστ Bayes rule: w(θ)f (x|µ) = w(µ|x)m(x)....so to find w(θ|x) let ρ = (τ 2 + σ 2 )/τ 2 σ 2 and complete the square: (µ − θ)/τ 2 + (x − µ)/σ 2 = 1 µ ρ x (θ − x)2 [µ − ( 2 + 2 )] + . 2 ρ τ σ 2(σ 2 + τ 2 ) This gives a useful form for h(µ, x). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Bayes continued To find w(µ|x), divide h(x, µ) by m(x): ∫ 1 2 2 2 m(x) = h(x, µ)dµ = √ e−(µ−x) /2(σ +τ ) = N(µ, σ 2 +τ 2 ). 2πρστ Now, w(µ|x) = h(x, µ)/m(x) equals √ ρ −ρ/2(µ−(1/ρ)(θ/τ 2 +x/σ2 ))2 e . 2π Thus, (µ|x) has a N(µ(x), 1/ρ) distribution where σ2 τ2 θ + x σ2 + τ 2 σ2 + τ 2 We can now get credible sets from N(µ(x), 1/ρ) for any x. If σ not known, we still get the tn−1 distribution... µ(x) = . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Bayesian, unknown σ Let Xi ∼ N(µ, σ 2 ) IID for i = 1, . . . , n and assume w(µ, σ 2 ) ∝ 1/σ 2 . It can be shown that w(µ|σ 2 , x n ) ∼ N(x̄, σ 2 ). Write w(µ|x n ), then ∫ n w(µ|x ) = w(µ|σ 2 , x n )dσ 2 . n 2 It can be shown √ that nw(µ|x ) ∼ tn−1 (x̄, s /n). That is, w((µ − x̄)/(s/ n)|x ) ∼ tn−1 . From w(µ, σ|x n ) ∝ w(µ, σ 2 )f (x n |µ, σ 2 ) we can get ∫ ( ) 1 2 2 2 2 n e−(1/2σ )(n−1)s +n(x̄−µ) dµ, w(σ |x ) ∝ 2 σ i.e., an Inv-χ2n−1,s2 = Inv-gamma((n − 1)/2, (n − 1)s2 /2). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Information-theory ∫ The entropy of a RV is H(X ) = − p(x) log p(x)dx; the entropy is the minimal noiseless codelength. The normal family can be derived from the following. Consider k functions f1 ,...,fk and numbers a1 , ..., ak and suppose E(fj (X )) = aj for j = 1, . . . , k . The maximum entropy distribution for X , if it exists, is p(x) = ce ∑k j=1 λj fj (x) ∫ where c and the λj ’s are found to make p(x)dx = 1. Given the first two moments, solving this gives the normal. That is, if k = 2, let f1 be X f2 = X 2 . Another coding argument gives estimates for µ and σ by two-stage coding Barron and Cover (1990). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Predictive Use x n to predict Xn+1 – analog to estimating µ. Consider sample mean x̄. If no model is assumed, set µ = EXi and σ 2 = Var(Yi ). Use standard inequalities (Markov, triangle, Cauchy-Schwarz) to obtain for given τ > 0. √ ) ( σ(1 + (1/ n)) P |Ȳ − Yn+1 | ≥ τ ( ) τ √ ≤ (E|Ȳ − µ|2 )1/2 + (E|µ − Yn+1 |2 )1/2 σ(1 + (1/ n)) ≤ τ. For τ < 1, the Frequentist PI for known σ is √ Ȳ ± σ(1 + (1/ n))/τ. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Normal Case If Xi ∼ N(µ, σ 2 ), X̄ − Xn+1 ∼ N(0, σ 2 (1 + n1 )) and the prediction interval becomes X̄ ± z1−α σ(1 + (1/n))1/2 where z1−α is the 100(1 − α) percentile of N(0, 1). So, if τ = 1/z1−α , the only difference between the general 1/2 versus case and √ the normal case is (1 + (1/n)) (1 + 1/ n) which is asymptotically negligible. If σ is unknown, the PI’s become √ √ n − 1 1 + (1/n) √ Ȳ ± σ̂ , τ n−3 An exact form for the normal can be found in this case too. See Geisser (1995), Chap. 2. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Bayes Prediction As above, the posterior mean is E(Θ|x n ) = τ 2 /(σ 2 /n + τ 2 )x̄ + (σ 2 /n)/(σ 2 /n + τ 2 )µ and the posterior variance is 1/ρ = τ 2 σ 2 /(nτ 2 + σ 2 ) = O(1/n). So, m(xn+1 |x n ) = N(E(Xn+1 |X n = x n ), Var(Xn+1 |X n = x n )), E(Xn+1 |X n = x n ) = E(Θ|X n = x n ), and Var(Xn+1 |X n = x n ) = σ 2 + Var(Θ|X n = x n ). PI is now E(Θ|X n = x n ) ± zα/2 Var(Xn+1 |X n = x n ). As before, this can be extended to the case that σ is not known by putting a prior on it. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Exponential Famlies A parametric family f (x|θ) is of exponential form ⇐⇒ ∃K f (x|θ) can be written as f (x|θ) = h(x)c(θ)e ∑K k =1 wk (θ)tk (x) . Support of X is independent of θ. The K functions tk are sufficent for θ in X , even as n increases. Convenient with independence: exponents add. Natural form: Replace wk (θ) by η. c(θ) is the normalizing constant, h(x) independent of x Examples: Normal, Poisson, Binomial, Exponential, Gamma, Dirichlet, χ2k ....etc Non-exponential families: Unif [0, θ], Cauchy, Laplace, tn ,... . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Sufficiency T (X ) is sufficent for θ in f (x|θ) ⇐⇒ the conditional distribution for X given T (X ) is independent of θ. Formally, T is sufficent ⇐⇒ P(X = x|T (X ) = t, θ) = P(X = x|T (X ) = t). Fisher’s Factorization criterion for sufficiency: T is sufficient for θ ⇐⇒ f (x n |θ) = g(T (x n )|θ)h(x n ) T partitions the sample space into sets τt = {x n |T (x n ) = t} so two x n ’s in the same τ lead to the same infrences. If T is sufficient, Frequentists like to use it for inference (esp.if minimal). If T is sufficent than w(θ|x n ) = w(θ|t(x n )). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Examples of Sufficiency If X n are IID from a full rank K natural parameter exponential then then ∑family, ∑ n n T (x ) = ( i=1 t1 (x1 ), . . . , ni=1 tK (x1 )) is (minimal) sufficient for θ. Sufficent statistics are not unique – any one-one function of a sufficient statistic is also sufficient. A sufficent statistic T is minimal ⇐⇒ for any other sufficient statistic S, ∃g so that T = g(S). Test for minimality: Suppose: f (x n |θ) f (y n |θ) constant as a function of θ ⇐⇒ T (x n ) = T (y n ). Then T is minimal sufficient. B. Clarke . . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . An Unusual Example Let Xi ∼ Unif [θ, θ + 1] be IID. Joint pdf is { 1, x(n) − 1 < θ < x(1) f (x n |θ) = 0, otherwise Check the condition of the theorem: f (x n |θ) constant in θ ⇐⇒ f (y n |θ) { x(n) = y(n) x(1) = x(1) So, T (X n ) = (X(1) , X(n) ) is minimal sufficient. T is two-dimensional but θ is unidimensional, f is not an exponential family. In a full rank exponential family of order K , there are K sufficient statistics. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Sufficiency and Ancillarity One extreme: The ordered data X(1) , ..., X(n) are always a sufficent statistic. The other extreme is that exponential famlies are almost characterized by having a finite dimensional sufficient statistic. Suppose X n are IID f (·|θ). Then, fθ has support independent of θ and there is a K dimensional sufficient statistics ⇐⇒ f ) · |θ) is of exponential form. What about functions of the data that are not sufficient? The question is whether they depend on the parameter...or have other information about the parameter. A statistic S(x n ) is ancillary ⇐⇒ distribution of S(x n ) does not depend on θ. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Properties of Ancillarity Example: Location scale family: Suppose we have a RV Z with density f . Write X = σZ + µ. Now, X has density f (x|µ, σ) = (1/σ)f ((x − µ)/σ) Assume IID data from f (x|µ, σ) and let R = X(n) − X(1) be the range. DF of R is FR,µ (r ) = Pµ (R ≤ r ) = Pµ (X(n) − X(1) ≤ r ) = Pµ ((X(n) − µ) − (X(1) − µ) ≤ r ) = Pf (Z(n) − Z(1) ≤ r ) which is constant in µ. So, R is ancillary. In a scale family, any statistic that is a function of Xi /Xn will be ancillary. (Think of the distribution function of (1/Xn )(X1 , . . . , Xn−1 ). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Completeness A family of densities f (T |θ) for a statistic T is complete ⇐⇒ ∀θ Eθ g(T ) = 0 implies g = 0 a.e. X̄ is complte for N(µ, 1). In a full rank ∑ exponential family ∑ T (X n ) = ( ni=1 T1 (Xi ), . . . , ni=1 TK (Xi ) is complete. Basu’s theorem: If T is complete and minimal sufficient then it is independent of every ancillary statistic. Curious fact: The distribution of an ancillary statistic does not depend on θ, but there can be information in the ancillary about θ (Fraser-Monette example). Basu’s theorem eliminates this possibility. A minimal sufficent statistic therefore has all the information in the data about the parameter. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation . Natural Sufficient Statistic Let X n be an IID∑sample from an exponential family and write Tk (X n ) = ni=1 Tk (Xi ). Suppose (w1 (θ), . . . , wK (θ)) contains an open set in IR K and (T1 (X n ), . . . , TK (X n )) contains an open set in IR K . Then: The distribution of (T1 , . . . , TK ) as RV’s is fT (u1 , . . . , uK |θ) = H(u1 , . . . , uK )c n (θ)e ∑K k =1 wk (θ)Uk . The sampling distribution of the complete, minimal sufficient summary statistics of an exponential family is of exponential form. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . MOM Suppose X n is IID f (x|θ1 , . . . , θK ). Equate the first K sample moments to the first K population moments ∑ and solve for estimates ∑ of the θk ’s. Write m1 = (1/n) ni=1 Xi , ..., mK = (1/n) ni=1 X K . Write µ1 (θ1K ) = E(X ), ..., µK (θ1K ) = E(X K ). Solve K equations: Mk = µk (θ1K ) for the K unknowns θ1K . ∑n 2 2 2 N(µ, σ ): We ∑nget µ̂2= X̄ 2and E(X ) = (1/n) i=1 Xi . So, 2 σ̂ = (1/n) i=1 Xi − µ̂ . Xi ∼ Bin(n, p): X̄ = E(X ) = np and X¯2 = np(1 − p) + (np)2 . Solving gives p̂ = X̄ /n̂ and n̂ = X̄ − (1/n) X̄ 2 ∑n i=1 (Xi − X̄ )2 . ← can be < 0 ! . . . Can use CLT’s to get asymptoic normality and CI’s. B. Clarke . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . MLE’s Recall the MLE is θ̂ = arg maxθ f (x n |θ). Properties: Not always unique: X ∼ Unif [θ, θ + 1] { 1, θ < 1 < θ + 1 L(θ|x) = 0, otherwise so any θ < 1 is an MLE. Discrete case: Xi ∼ Bin(n, p). MLE for n? Then n ( ) ∏ n xi n L(n, p|x ) = p (1 − p)1−xi . xi i=1 Observe = 0 for n < maxi xi so n ≥ maxi xi Want least n satisfying L(n|x n ) ≥ L(n − 1|x n ) and L(n + 1|x n ) ≤ L(n|x n ), then argue uniqueness. L(n|x n ) . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . Uniqueness Result for MLE’s May need tricks to find MLE’s...can’t always solve (log L(θ|x))′ = 0. Discrete parameters, and boundary problems of parameter space may be problems. Often helpful to take logs... In exponential families we have MLE’s: Let f (x|θ) = e ∑K k =1 wk (θ)Tk (X )+log c(θ)+log h(X ) If C is the interior of the range of (w1 (θ), . . . , wK (θ)) ⊂ IR K for θ ∈ Θ, and the equations Eθ Tk (X ) = Tk (x n ) have a solution say θ̂ = (θ̂1 (x n ), . . . , θ̂1 (x n )) for which (c1 (θ̂1 ), . . . , cK (θ̂k )) ∈ C Then the MLE is unique. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . Invariance If θ̂ = MLE(θ), then for any τ (θ), MLE(τ ) = τ (θ̂). Proof: If τ is one-one and η = τ (θ) then take sup’s on both sides o L∗ (η|x n ) = L(τ −1 (η)|x n ) = L(θ|x n ). If τ is not one-one, let L∗ (η|x n ) = sup{θ| τ (θ)=η} L(θ|x n ) and proceed as before. Example: Suppose Yi = β0 + β1 Xi + ϵi , ϵi ∼ N(0, σ 2 ) For fixed σ, can find MLE’s for β0 and β1 from minimizing n ∑ (yi − β0 − β1 xi )2 . i=1 This is the exponent in the normal so MLE(β0 , β1 ) is the LSE. Putting β̂0 and β̂1 into L′ (β̂0 , β̂1 , σ 2 |Y1n ) = 0 gives ∑ σ̂ 2 = (1/n) ni=1 (yi − β̂0 − β̂1 xi )2 . . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . Asymptotics of the MLE I A sequence of estimators Wn = Wn (X n ) is consistent for p τ (θ) ⇐⇒ ∀θ Wn → τ (θ) . A sequence of estimatorsWn is asymptotically efficient for τ (θ) ⇐⇒ Wn achieves the CRLB in the limit of large n: lim n→∞ Varθ (Wn ) (τ ′ (θ)2 /nI(θ)) where I(θ) = Eθ ((∂/∂θ) log f (X |θ))2 . Asymptotic normality means ∃V > such that √ D n(θ̂ − θ) → N(0, V ). MLE’s are consistent, asymptotically normal and efficient. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . Asymptotics of the MLE II pθ Under ‘regularity’ conditions we have that 1) θ̂ −→ θ √ D 2) n(θ̂ − θ) −→ N(0, 1/I(θ)) Regularity conditions: (i) (∂/∂θ)i f (·|θ) exists for i = 1, 2, 3, (ii) (∂/∂θ)i f (·|θ) bounded by integrable functions of x locally in θ i=1, 2, 3 , (iii) E[(∂/∂θ) ln f (X |θ)]2 < ∞. h(θ̂) is also AN(h(θ), (h′ (θ)/nI(θ))2 ). Same result ∑ holds with Î(θ̂) = (1/n) ni=1 (∂ 2 /∂θ2 ) ln f (·|θ)pθ I(θ). → Multivariate version too: Under regularity conditions, pθ (θ̂1 , . . . , θ̂K ) −→ (θ1 , . . . , θK ) √ D n(θ̂ − θ) −→ N(0, I(θ)−1 ). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . Best Unbiased The MSE of an estimator θ̂ of θ is Eθ (θ̂ − θ)2 = E(θ̂ − E(θ̂))2 + (E θ̂ − θ)2 i.e., MSE = Variance plus bias-squared. Let’s search the collection of unbiased estimators for the one with the smallest variance. Definition: θ̂ is the best unbiased estimator of θ ⇐⇒ Eθ θ̂ = θ and ∀θ and ∀ unbiased θ∗ Varθ (θ̂) ≤ Varθ (θ∗ ). In Poisson(λ), λ̂ = X̄ and S 2 both estimate Varλ (X ) = λ. Can show X̄ has a lower variance than S 2 . . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing For IID data, the smallest variance given by CRLB: Varθ (Wn ) ≥ (Eθ Wn )′ /nI(θ). If X ∼ fθ independent of Y ∼ gθ then IX ,Y (θ) = IX (θ) + IY (θ). Fisher information for Poisson(λ) is I(λ) = 1/λ. Since λ′ = 1 and Varλ (X̄ ) = λ/n, we see that X̄ attains the CRLB and so is best unbiased or UMVU. Roughly, if Xi ’s are IID f (·|θ) and W (X n ) is unbiased for τ (θ) then Wn attains the CRLB ⇐⇒ ∃a(θ) so that ∂ a(θ)(W (X n ) − τ (θ)) = ln f (x n |θ). ∂θ Strange fact: The∑ best unbiased estimator for σ 2 in 2 N(µ, σ ) is (1/n) (Xi − µ)2 , so, if µ is unknown, you can’t attain the CRLB. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . UMVU and Sufficiency Rao-Blackwell Theorem: Let W be unbiased for τ (θ) and let T be sufficient for θ. Let ϕ(T ) = E(W |T ). Then: (1) Eϕ(T ) = τ (θ) and (2) ∀θ we have Varθ (ϕ(T )) ≤ Varθ (W ). UMVUE’s are unique Let W be unbaised for τ (θ). Then τ is UMVU ⇐⇒ W is uncorrelated with any unbiased estimator of 0. Theorem: Let T be complete nd sufficient for θ and let ϕ(T ) be an estimator. Then, ϕ(T ) is the unique UMVUE for Eθ ϕ(T ). Lehmann-Scheffe Theorem: Let T be complete and sufficient for θ and let h(X ) be unbiased for τ (θ). Then W = ϕ(T ) = E(h(X )|T ) is UMVUE for τ (θ). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . Example E.g., Xi ∼∑ Bin(n, p) IID. Let τ (p) = Pp (one success). Then, T (X n ) = Xi is complete and sufficient, but not unbiased. Let h(X ) = χX = 1 then E(h(X )) = τ (θ) Theorem ⇒ ϕ(T ) = E(h(X )|T ) is UMVU for τ (θ). Work out what ϕ is: ϕ(t) = E(h(X )|T = t) = P(X1 = 1|T = t) Writing out the conditiona probability and simplifying gives ( ) ( ) n−1 n ϕ(T ) = n / . T −1 T . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . Neyman-Pearson Framework Basic hypothesis testing problem: H : θ ∈ ΩH vs K : θ ∈ ΩK , ΩH ∩ ΩK = ∅. Suppose we base our decision on an outcome of X . A test is defined by a region S and x ∈ S means do not reject H while x ∈ S c means reject H. 4 cases: θ ∈ H, K and we choose H, K. Want P(Type I error) = PH (reject H) low. Want P(Type II error) = 1 − PK (reject H) high. Power function is Pθ (reject H), a function of θ. Want power low on ΩH and high on ΩK . . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . NPFL intuition A test defined by a region S is level α ⇔ Pθ (S c ) ≤ α for all θ ∈ ΩH . Suppose ΩH = P0 and ΩK = P1 . Then, we want to find points for a set S so that we can reject on S c and ∑ P0 (x) ≤ α and x∈S c ∑ P1 (x) maximum. x∈S c The most valuable points have a high value of P1 (x)/P2 (x). So, rank the points in decreasing order by P1 (x)/P2 (x) and then start putting them in until you hit α in terms of P0 . This leads to the Most Powerful test in the simple-vs-simple case. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . NPFL Let P0 and P1 have pdf’s p0 and p1 . For testing H : P0 vs K : P1 there is a critical function ϕ and a constant k so that (1) E(ϕ) = α and (2) { 1 p1 (x) > kp0 (x) ϕ(x) = 0 p1 (x) < kp0 (x) If a ϕ satisfies (1) and (2) for some k , it is MP for P0 vs P1 . If ϕ is a MP level α test for P0 vs P1 then ∃k so that (2) is satisfied. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . MP and Sufficiency Consider H : θ = θ0 vs K : θ = θ1 . Suppose T is sufficient for θ with density g(T |θ). Then: Any test based on T with rejection region S is an MP level α test if it satifies { g(t|θ1 ) > kg(t|θ0 ) ⇒ t ∈ S g(t|θ1 ) < kg(t|θ0 ) ⇒ t ∈ S c for some k > 0 where α = Pθ0 (T ∈ S) We rarely test point nulls aginst point nulls...so we want a concept of uniformly most powerful i.e., tests that are good for composite hypotheses. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Method of Moments MLE’s UMVUE’s Testing . Uniform NPFL Consider H : θ ∈ ΩH vs K : θ ∈ ΩK , with ΩH = ΩcK and suppose we have a test based on a sufficient statistic T with pdf g(t|θ) and rejection region S. If (1) the test is level α, (2) ∃θ0 ∈ ΩH with Pθ0 (S) = α, and (3) For the same θ0 as in (2) we have ∀θ∗ ∈ ΩK there is a K > 0 so that { g(t|θ∗ ) > kg(t|θ0 ) ⇒ t ∈ S g(t|θ∗ ) < kg(t|θ0 ) ⇒ t ∈ S c . Then: This test is UMP level α for H vs K. This result works for many one-sided tests, in particular for one-dimensional exponential families. Also....Monotone Likelihood Ratio, unbiasedness, etc. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing . Examples We already saw that a normal prior on a normal likelihood gave a normal posterior for estimating the mean. Let P be a class of prior densities and let F be a class of densities from a parametric family. P is conjugate to F ⇐⇒ ∀f ∈ F , ∀w ∈ P, ∀x n w(θ|x n ) ∈ P. That is, the posterior is in the same class as the prior. Suppose Xi are IID ∏ Poisson(λ) so n n x̄ −nλ f (x |λ) = λ e / ni=1 xi !. Let λ ∼ Gamma(α, β). Then, the joint density is h(x n , λ) = λnx̄+α−1 e−λ(n+1/β) . Γ(α)β α πxi ! So, w(λ|x n ) is a Gamma(nx̄ + α, (n + 1/β)−1 ). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing . Conjugacy and Exponential Families Recall the exponential family ∑ ∑K ∑n n c (θ) Tk (xi )+ ni=1 S(xi )+nd(θ) . k k =1 i=1 f (x |θ) = e We obtain a conjugate prior by using the sufficient statistics and n. ∑ Let tk = ni=1 Tk (xi ) for k = 1, . . . , K and tK +1 = n. ∫ ∑K Let w(t1 , . . . , tk+1 ) = e k =1 ck (θ)tk +tK +1 d(θ) dθ. Let Ω = {t1K +1 | w(t1K +1 ) < ∞} The K + 1 parameter exponential family given by w(θ|t1K +1 ) = e− ∑K K +1 ) k=1 ck (θ)tk +tK +1 d(θ)−ln w(t1 is conjugate to f (x n |θ) and w(t1K +1 ) is the normalizing constant so the data is fixed. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing . Basic Setup If you have a good idea about the various costs of the ways to be wrong you can make decisions that minimize the average cost of errors. We have a model f (·|θ), a parameter space Θ, data x from a sample space. When we get data we make a decision about where θ is by using a rule δ. The collection of rules we allow ourselves is A, the action space. We measure cost by a loss function L : Θ × A → IR A can be Θ (estimation) or {acceptH, rejectH} (testing). L(θ, δ(x)) is the cost of δ(x) when θ is the state of nature. How to choose a good δ? Try averaging: R(θ, δ) = Eθ L(θ, δ(X )), the expected loss. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing . Bayes Optimality R(θ, δ) is called the risk of δ. Smaller risk is better, so we can compare curves gδ (θ) = R(θ, δ) for various δ. If the curve for a δ does not admit a uniform improvement then δ is called admissible. This is hard to work with. Assume a prior w(θ) and form the Bayes risk: ∫ R(w, δ) = Ew R(Θ, δ) = w(θ)R(θ, δ)dθ ∫ [ ] = m(x)w(θ|x)L(θ, δ(x))dxdθ = Em Ew(·|x) L(Θ, δ(X )) The posterior risk (in [ ]’s) gives the Bayes estimator: δB = arg min R(w, δ) = arg min R(w, δ|x) . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing The Bayes or posterior risk (they are equivalent) depends on w, not θ. An alternative is the minimax (mM) approach: Minimize the maximum risk. That is find δ = arg min max R(θ, δ) = arg RmM . δ θ Or, maximize the minimum risk (maximin, Mm): δ = arg max min R(w, δ) = arg RMm . w δ Both are global criteria; the Mm risk is based on the Bayes estimator, δB . RmM ≤ RMm . The Game Theorem asserts they are equal. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing . Two Loss Functions Given a loss function we can, in principle, work out the Bayes estimator (and the mM, Mm, admissible etc). Suppose L(θ, δ) = (θ − δ)2 . ∫ arg min w(θ|x)(θ − δ(x))2 dθ = E(Θ|x). δ Suppose L(θ, δ) = |θ − δ|. ∫ arg min w(θ|x)|θ − δ(x)|dθ = median(w(θ|x)). δ Other loss functions give percentiles (asymmetric absolute value) or m(x) (relative entropy), LINEX loss etc etc . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing . Generalized 0-1 Loss Bayes testing is the Bayes action under generlaized 0-1 loss, i.e., it has the smallest Bayes risk. Consider H : θ ∈ ΩH vs. K : θ ∈ ΩK . Let a0 (a1 ) be the action that we choose H (K). So, given a rule δ, the acceptance region for H is {x|δ(x) = a0 }. So, define the loss 0 θ ∈ ΩH , a = a 0 0 θ ∈ Ω , a = a K 1 L(θ, a) = c1 θ ∈ ΩH , a = a1 c θ ∈ Ω , a = a 2 K 0 c1 (c2 ) is the cost of a Type I (II) error. B. Clarke . . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing . Risk of a Test Let R = {x| δ(x) = a1 } and let β(θ) = Pθ (X ∈ R). Then { c1 β(θ) θ ∈ ΩH E(θ, δ) = c2 (1 − β(θ)) θ ∈ ΩK So, if c1 = c2 = 1 we have the usual power function of a Frequentist test. Theorem: Under generalized 0-1 loss, any test of the form ‘reject H : θ ∈ ΩH when W (ΩK |x) < c2 /(c1 + c2 )’ is Bayes optimal. That is, δB (x) = χ(W (ΩK |x) < c1 /(c1 + c2 )) is the Bayes optimal test and achieves minδ R(w, δ|x). . B. Clarke . . . Review of Bayesian and Frequentist Statistics . . Basic Definitions Models of Variability The Normal Example Suffiency and Exponential Families Main Frequentist Estimators Bayesian Estimation Conjugate Priors Decision Theory Testing . A Final Example Suppose X ∼ N(θ, σ 2 ) and w is N(µ, τ 2 ) with µ, σ and τ known. Let η = σ 2 /(σ 2 + τ 2 ). (Θ|X̄ ) has distribution N(E(Θ|X̄ ), Var(Θ|X̄ )) = N((1 − η)x̄ + ηµ, τ 2 η). Test H : θ ≥ θ0 vs K : θ < θ0 . x̄−ηµ √ W (H|x) = W (Θ ≥ θ0 |x) = P(N(0, 1) > θ0 −(1−η) |x̄). τ η Now W (H|x) ≤ c2 /(c1 + c2 ) = α ∈ (0, 1) √ ⇐⇒ (θ0 − (1 − η)x̄ − ηµ)/τ η > zα √ η(µ−θ0 )−zα τ η ⇐⇒ x̄ < θ0 − 1−η Thus, rejecting H for small values of x̄ is Bayes optimal where the threshold depends on c1 , c2 and the prior. . B. Clarke . . . Review of Bayesian and Frequentist Statistics . .