Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1 Basic Concepts Thomas Bayes (1702-1761): two articles from his pen published posthumously in 1764 by his friend Richard Price. Laplace (1774): stated the theorem on inverse probability in general form. Jeffreys (1939): rediscovered Laplace’s work. Example 1: yi , i 1, 2,, n : the lifetime of batteries 2 Assume y i ~ N , . Then, p y | , 2 n 1 exp 2 2 n y i 1 i 2 t , y y1 , , yn . To obtain the information about the values of and , two methods 2 are available: (a) Sampling theory (frequentist): and 2 are the hypothetical true values. We can use 2 point estimation: finding some statistics ˆ y and ˆ y to estimate and , for example, 2 n ˆ y y y i 1 n n i 2 , ˆ y y i 1 y 2 i n 1 . interval estimation: finding an interval estimate ˆ1 y , ˆ2 y 1 and ˆ12 y , ˆ 22 y for estimate for and , for example, the interval 2 , s s , Z ~ N (0,1). y z , y z , P Z z 2 2 2 2 n n (b) Bayesian approach: 2 Introduce a prior density , for and . Then, after some 2 manipulations, the posterior density (conditional density given y) f , 2 | y can be obtained. Based on the posterior density, inferences about and 2 can be obtained. Example 2: X ~ b10, p : the number of wins for some gambler in 10 bets, where p is the probability of winning. Then, 10 10 x f x | p p x 1 p , x 1,2, ,10. x (a) Sampling theory (frequentist): To estimate the parameter p, we can employ the maximum likelihood principle. That is, we try to find the estimate p̂ to maximize the likelihood function 10 x 10 x l p | x f x | p p 1 p . x 2 x 10 , For example, as Thus, 10 1010 l p | x l p | 10 p10 1 p p10 . 10 ˆ 1 . It is a sensible estimate. Since we can win all the p time, the sensible estimate of the probability of winning should be 1. On the other hand, as Thus, x 0, 10 100 10 l p | x l p | 0 p 0 1 p 1 p . 0 ˆ 0 . Since we lost all the time, the sensible estimate of p the probability of winning should be 0. In general, as ˆ p xn, n , n 0,1,,10, 10 maximize the likelihood function. (b) Bayesian approach:: p : prior density for p, i.e., prior beliefs in terms of probabilities of various possible values of p being true. Let p r a b a 1 b 1 p 1 p Beta a, b . r a r b Thus, if we know the gambler is a professional gambler, then we can use the following beta density function, p 2 p Beta2,1 , to describe the winning probability p of the gambler. The plot of the density function is 3 1.0 0.0 0.5 prior 1.5 2.0 Beta(2,1) 0.0 0.2 0.4 0.6 0.8 1.0 p Since a professional gambler is likely to win, higher probability is assigned to the large value of p. If we know the gambler is a gambler with bad luck, then we can use the following beta density function, p 21 p Beta1,2 , to describe the winning probability p of the gambler. The plot of the density function is 1.0 0.0 0.5 prior 1.5 2.0 Beta(1,2) 0.0 0.2 0.4 0.6 0.8 1.0 p Since a gambler with bad luck is likely to lose, higher probability is 4 assigned to the small value of p. If we feel the winning probability is more likely to be around 0.5, then we can use the following beta density function, p 6 p1 p Beta2,2 , to describe the winning probability p of the gambler. The plot of the density function is 0.0 0.5 prior 1.0 1.5 Beat(2,2) 0.0 0.2 0.4 0.6 0.8 1.0 p If we don’t have any information about the gambler, then we can use the following beta density function, p 1 Beta1,1 , to describe the winning probability p of the gambler. The plot of the density function is 1.0 0.9 0.8 prior 1.1 1.2 Beta(1,1) 0.0 0.2 0.4 0.6 p 5 0.8 1.0 posterior density of p given x conditiona l density of p given x f x, p joint density of x and p f x marginal density of x f x | p p f x | p p l p | x p f x f p | x Thus, the posterior density of p given x is f p | x p l p | x r a b a 1 b 1 n n x p 1 p p x 1 p r a r b x ca, b, x p xa 1 1 p b 10 x 1 In fact, r a b 10 b 10 x 1 p x a1 1 p r x a r b 10 x Beta x a, b 10 x f p | x Then, we can use some statistic based on the posterior density, for example, the posterior mean E p | x pf p | x dp 1 0 As xa b 10 x . xn, ˆ E p | n p an b 10 n is different from the maximum likelihood estimate n 10 . Note: f p | x p l p | x the original informatio n about p the informatio n from the data the new informatio n about p given the data 6 Properties of Bayesian Analysis: 1. Precise assumption will lead to consequent inference. 2. Bayesian analysis automatically makes use of all the information from the data. 3. The inferences unacceptable must come from inappropriate assumption and not from inadequacies of the inferential system. 4. Awkward problems encountered in sampling theory do not arise. 5. Bayesian inference provides a satisfactory way of explicitly introducing and keeping track of assumptions about prior knowledge or ignorance. 1.1 Introduction Goal: statistical decision theory is concerned with the making of decisions in the presence of statistical knowledge which sheds lights on some of the uncertainties involved in the decision problem. 3 Types of Information: 1. Sample information: the information from observations. 2.Decision information: the information about the possible consequences of the decisions, for example, the loss due to a wrong decision. 3. Prior information: the information about the parameter. 1.2 Basic Elements : parameter. : parameter space consisti1ng of all possible values of 7 . a: decisions or actions (or some statistic used to estimate : the set of all possible actions. ). L , a : R : loss function L1 , a1 : the loss when the parameter value is 1 a1 and the action is taken. X X 1 , X 2 ,, X n : X 1 , X n are independent observations from a common distribution : sample space (all the possible values of X, usually subset of will be a R n ). f x | dx A f x1 ,, xn | dx1 dxn A X P X A dF x | A f x | A f x1 , , xn | A where F X x | is the cumulative distribution of X. hx f x | dx E h X hx dF X x | . h x f x | Example 2 (continue): Let A 1, 3, 5, 7, 9. Then, 8 10 10 10 x 10 x Pp X A p x 1 p p x 1 p xA x x1, 3, 5, 7 , 9 x 10 10 10 10 10 9 7 5 3 p1 p p 3 1 p p 5 1 p p 7 1 p p 9 1 p 1 3 5 7 9 Let X 10 . a1 the estimate of p Also, let X X h X L p , p loss function 10 10 Then, X X E p h X E p L p, E p p 10 10 10 p X Ep p p0 . 10 10 Example 3: Let X ~ beta ,1 and hx x 2 . Then, E h X hx f x | dx x 2x 1dx 1 0 2 Example 4: 9 x 2 |10 2 . a1 : sell the stock. a2 : keep the stock. 1 : stock price will go down. 2 : stock price will go up. Let L1 , a1 500, L1 , a2 300, L 2 , a1 1000, L 2 , a2 300 The above loss function can be summarized by 1 2 a1 a2 -500 1000 300 -300 Note that there is no sample information from an associated statistical experiment in this example. We call such a problem no-data problem. 1.3 Expected Loss, Decision Rules, and Risk Motivation: In the previous section, we introduced the loss of making a decision (taking an action). In this section, we consider the “expected” loss of making a decision. Two types of expected loss are considered: Bayesian expected loss Frequentist risk (a) Bayesian Expected Loss: Definition: The Bayesian expected loss of an action a is 10 , a E L , a L , a dF L , a d where distribution of and F are the prior density and cumulative , respectively. Example 4 (continue): Let 1 0.99, 2 0.01. Then, , a1 E L , a1 1 L1 , a1 2 L 2 , a1 0.99 500 0.011000 485 and , a2 E L , a2 1 L1 , a2 2 L 2 , a2 0.99 300 0.01 300 294 (b) Frequentist Risk: Definition: A (nonrandomized) decision rule R . If X is a function from into X x0 is observed, then x0 is the action that will be taken. Two decision rules, 1 and 2 , are considered equivalent if P 1 X 2 X 1, for every Definition: The risk function of a decision rule X 11 is defined by . R , E L , X L , x dF X x | L , x f x | dx Definition: If R , 1 R , 2 , for all , , then the decision rule 1 with strict inequality for some R-better than the decision rule is 2 . A decision rule is admissible if there exists no R-better decision rule. On the other hand, a decision rule is inadmissible if there does exist an R-better decision rule. Note: A rule 1 is R-equivalent to 2 if R , 1 R , 2 , for all . Example 4 (continue): R1 , a1 L1 , a1 500 300 L1 , a2 R1 , a2 and R 2 , a1 L 2 , a1 1000 300 L 2 , a2 R 2 , a2 . Therefore, both a1 and a2 are admissible. Example 5: Let 12 X ~ N ,1, L , a a , 1 X X , 2 X 2 Note that E X and X . 2 Var X 1. Then, R , 1 E L , 1 X E L , X E X 2 Var X 1 and X X R , 2 E L , 2 X E L , E 2 2 2 X X E E 2 2 2 2 X 2 X 2 E 2 2 2 2 4 2 2 2 2 2 X X E E 2 2 2 2 4 2 Var ( X ) 2 X Var 0 2 4 4 4 2 1 2 4 4 Definition: The Bayes risk of a decision rule on with respect to a prior distribution is defined as r , E R , R , a d Example 5 (continue): 13 Let ~ N 0, 2 , Then, 1 2 2 2 e 2 2 r , 1 E R , 1 E 1 1 and 1 2 r , 2 E R , 2 E 4 4 1 E 2 1 Var ( ) 4 4 4 4 1 2 4 4 1.4 Decision Principles The principles used to select a sensible decision are: (a) Conditional Bayes Decision Principle (b) Frequentist Decision Principle. (a) Conditional Bayes Decision Principle: Choose an action a A minimizing a Bayes action and will be denoted Example 4 (continue): 14 a , a . Such a will be called . Let A a1 , a2 , 1 0.99, 2 0.01. Thus, , a1 485, , a2 294 . Therefore, a a1 . (b) Frequentist Decision Principle: 3 most important frequentist decision principle: Bayes risk principle Minimax principle Invariance principle (1) Bayes Risk Principle: 1 , 2 D , a decision Let D be the class of the decision rules. Then, for rule 1 is preferred to a rule 2 based on Bayes risk principle if r , 1 r , 2 . A decision rule minimizing r , among all decision rules in class D is called a Bayes rule and will be denoted as . The quantity r r , is called Bayes risk for . Example 5 (continue): X ~ N ,1, ~ N 0, 2 , 15 D cx : c is any constant . Let c X cX . Then, R , c E cX E cX E cX c c 1 2 2 2 E cX c 2cX c c 1 c 1 2 2 2 c 2Var X c 1 2 2 c 2 c 1 2 2 and , c E R , c E c 2 c 12 2 c 2 c 1 E 2 2 c 2 c 1 2 2 Note that , c is a function of c. , c attains its 2 minimum as c , 1 2 2 ( f c , c c c 1 , f c 0 c ) 1 2 2 2 2 ' . Thus, 2 1 2 2 X X 2 1 is the Bayes estimator. In addition, 16 2 r r , 2 2 2 1 1 2 2 2 2 1 2 1 4 2 2 2 1 1 2 2 2 1 2 1 2 2 2 1 2 Example 4 (continue): Let D a1 , a2 , 1 0.99, 2 0.01 . Then, r , a1 E R , a1 E L , a1 485 and r , a2 E R , a2 E L , a2 294 . a2 Thus, is the Bayes estimator. Note: In a no-data problem, the Bayes risk (frequentist principle) is equivalent to the Bayes expected loss (conditional Bayes principle). Further, the Bayes risk principle will give the same answer as the conditional Bayes decision principle. Definition: Let X X 1 , X 2 ,, X n have the probability distribution function (or probability density function) f x | f x1 , x2 ,, xn | with prior density and prior cumulative distribution 17 and F , respectively. Then, the marginal density or distribution X X 1 , X 2 ,, X n of mx mx1 , x2 ,, xn is f x | d f x | dF f x | The posterior density or distribution of given x f | x f | x1 , x2 ,, xn The posterior expectation of g given is f x | mx x . is g f x | d g f | x d m x E f |x g g f | x g f x | m x Very Important Result: Let X X 1 , X 2 ,, X n have the probability distribution function (or probability density function) f x | f x1 , x2 ,, xn | with prior density and prior cumulative distribution and F , respectively. Suppose the following two assumptions hold: (a) There exists an estimator 0 with finite Bayes risk. 18 (b) For almost all x , there exists a value x minimizing L , x f | x d f | x L , x , E L , x f | x Then, (a) if L , a a g 2 , then g f | x d f | x g x E g f | x and, more generally, if L , a w a g 2 , then w g f | x d f | x w f | x E w g x f | x w w g f | x E w f | x (b) if then L , a a x , is the median of the posterior density or distribution 19 f | x given x . Further, if of k 0 a , a 0 L , a , k1 a , a 0 then x is the k0 k0 k1 percentile of the posterior density or f | x distribution of given x . (c) if 0 when a c L , a 1 when a c then, x is the midpoint of the interval I of length 2c which maximizing f | x d I P I | x f | x I [Outline of proof] (a) , a E f |x L , x E f |x w a g 2 E f |x w a 2 2ag g 2 E f |x w a 2 2 E f |x g w a E f |x g 2 w Thus 20 , a 2 E f |x w a 2 E f |x g w 0 a E f |x g w a E f |x w (b) Without loss of generality, assume m is the median of f | x . We want to prove , m , a E f |x L , m L , a 0 , a m . for Since L , m L , a m a m a m a, m a 2 m a, m a a m, m m a a then E f |x L , m L , a m a P m | x 2 m a Pm a | x a m P a | x m a 2 m a m a P m | x a m Pm a | x m a m a a m P a | x m a P m | x a m P m | x ma am 0 2 2 21 [Intuition of the above proof:] a1 a3 a2 c1 c3 c2 3 We want to find a point c such that a i 1 2 i 1 i achieves its c a2 , minimum. As 3 ca ai a2 a1 a2 a2 a2 a3 a3 a1 . As c c1 3 c i 1 ai c1 a3 a3 a1 . 1 As 3 c c2 c2 ai c2 a1 c2 a2 c2 a3 a3 a1 c2 a2 i 1 , a3 a1 As c c3 3 c i 1 3 ai c3 a1 a3 a1 . 3 Therefore, As c a2 , ca i 1 i achieves its minimum. (2) Minimax Principle: A decision rule principle if 1 is preferred to a rule 2 based on the minimax sup R , 1 sup R , 2 . 22 A decision minimizing sup R , among all decision rules in M class D is called a minimax decision rule, i.e., sup R , M inf sup R , . D Example 5 (continue): D cx : c is any constant and R , c c 2 c 1 2 . 2 Thus, 1 if c 1 sup R , c sup c 2 1 c 2 2 . if c 1 Therefore, M 1 X X is the minimax decision rule. Example 4 (continue): D a1 , a2 . Then, sup R , a1 sup L , a1 1000 and sup R , a 2 sup L , a 2 300 Thus, . M a2 . (3) Invariance Principle: If two problems have identical formal structures (i.e., the same sample space, parameter space, density, and loss function), the same decision 23 rule should be obtained based on the invariance principle. Example 6: X: the decay time of a certain atomic particle (in seconds). Let X be exponentially distributed with mean f x | 1 e x Suppose we want to estimate the mean , ,0 x . . Thus, a sensible loss function is L , a 1 2 a 2 a 1 2 . Suppose Y: the decay time of a certain atomic particle (in minutes). Then, X 1 y Y , f y | e , 0 y , 60 60 . Thus, 2 a 2 a 60 a L , a 1 1 1 L , a 60 2 where a a 60 .. Let X : the decision rule used to estimate 24 , and Y : the decision rule used to estimate . Based on the invariance principle, X X Y X 60 Y 60 . 60 60 The above augments holds for any transformation of the form Y cX , c R , based on the invariance principle. Then, 1 c 1 1 X X Y cX X X 1 kX , k 1 c c Thus, X kX is the decision rule based on the invariance principle. 1.5 Foundations There are several fundamental principles discussed in this section. They are : (a) Misuse of Classical Inference Procedure (b) Frequentist Perspective (c) Conditional Perspective (d) Likelihood Principle (e) Choosing Decision Principle (a) Misuse of Classical Inference Procedure: Example 7: Let X 1 , X 2 ,, X n ~ N ,1 In classical inference problem, 25 H 0 : 0 v.s. H1 : 0 , the rejection rule is n n x 1.96, x as 0.05. x i i 1 n , 10 24 Assume the true mean 10 , n 10 , and X ~ N 1010 ,1024 Suppose x 10 11 . , then n x 10 24 10 11 10 1.96 and we reject H 0 . Intuitively, x 10 11 seems to strongly indicate H 0 should be true. However, for a large sample size, even as x is very close to 0, the classical inference method still indicates the rejection of H 0 . The above result seems to contradict the intuition. Note: it might be more sensible to test, for example, H0 : 103 v.s. H1 : 103 . Example 8: Let X1 , X 2 ,, X100 ~ N ,1 26 In classical inference problem, H 0 : 0 v.s. H1 : 0 , the rejection rule is 100 n x 10 x 1.645, x x i 1 i 100 as 0.05. , If x 0.164,. then 10x 1.64 1.645 and we do not reject H 0 . However, as 0.51 , p value PX x PX 1.64 0.0505 0.51 , we then reject H 0 . (b) Frequentist Perspective: Example 9: Let X1 , X 2 ,, X100 ~ N ,1 In classical inference problem, H 0 : 0 v.s. H1 : 1 , the rejection rule is 100 n x 10 x 1.645, x x i 1 i 100 , as 0.05. By employing the above rejection rule, about 5% of all rejection of the null hypothesis will actually be in error as H 0 is true. 27 However, suppose the parameter values 0 and 1 occur equally often in repetitive use of the test. Thus, the chance of H 0 being true is 0.5. Therefore, correctly speaking, 5% error rate is only correct for 50% repetitive uses. That is, one can not make useful statement about the actual error rate incurred in repetitive use without knowing R , for all (c) . Conditionalt Perspective: Example 10: Frequentist viewpoints: X 1 , X 2 are independent with identical distribution, P X i 1 P X i 1 1 . 2 Then, X1 X 2 X 1 , X 2 2 X 1 1 can be used to estimate if X 1 X 2 if X 1 X 2 . In addition, P X 1 , X 2 P X 1 1, X 2 1 or X 1 1, X 2 1 P X 1 1, X 2 1 1 1 1 1 2 2 2 2 2 0.75 Thus, a frequentist claims 75% confidence procedure. 28 Conditional viewpoints: Given X 1 X 2 , X 1 , X 2 is 100% certain to estimate correctly, correctly, i.e., P X1 , X 2 | X1 X 2 1 . Given X 1 X 2 , X 1 , X 2 is 50% certain to estimate i.e., P X1, X 2 | X1 X 2 0.5 . Example 11: X 1 0.005 0.0051 1.02 f x | 0 f x | 1 f x | 1 f x | 0 2 0.005 0.9849 196.98 3 0.99 0.01 0.01 X 1 : some index (today) indicating the stock (tomorrow) will not go up or go down X 2 : some index (today) indicating the stock (tomorrow) will go up X 3 : some index (today) indicating the stock (tomorrow) will go down 0 : the stock (tomorrow) will go up 1 : the stock (tomorrow) will go up Frequentist viewpoints: To test H 0 : 0 v.s. H1 : 1 , by the most powerful test with 0.01, we reject H 0 as X 1, 2 since P X 1, 2 | 0 0.005 0.005 0.01. 29 Thus, as X 1 , we reject H 0 and conclude the stock will go up. This conclusion might not be very convincing since the index does not indicate the rise of the stock. Conditional viewpoints: As X 1 , f 1 | 1 1.02 . f 1| 0 Thus, f 1 | 1 and f 1 | 0 are very close to each other. Therefore, based on conditional viewpoints, about 50% chance, the stock will go up tomorrow. Example 12: Suppose there are two laboratories, one in Kaohsiung and the other in Taichung. Then, we flip a coin to decide the laboratory we will perform an experiment at: Head: Kaohsiung ; Tail: Taichung Assume the coin comes up tail. Then, the laboratory in Taichung should be used. Question: should we need to perform another experiment in Kaohsiung in order to develop report? Frequentist viewpoints: we have to call for averaging over all possible data including data obtained in Kaohsiung. Conditional viewpoints: we don’t need to perform another experiment in Kaohsiung. We can make statistical inference based on the data we have now. 30 The Weak Conditionality Principle: Two experiments E1 or E2 can be performed to draw information about . Then, the actual information about should depend only on the experiment E j j 1 or 2 that is actually performed. (d) Likelihood Principle: Definition: For observed data x, the function l f x | , considered as a function of , is called the likelihood function. Likelihood Principle: All relevant experimental information is contained in the likelihood function for the observed x. Two likelihood functions contain the same information about if they are proportional to each other. Example 13: : the probability that a coin comes up head. Suppose we want to know if the coin is fair, i.e., H0 : 1 1 v.s. H1 : , 2 2 with 0.05. Then, we flip a coin in a series of trials, 9 heads and 3 tails. Let X : the number of heads. Two likelihood functions can be used. They are: 1. Binomial: 31 n n x X ~ B12, p , l1 f1 x | x 1 x In this example, 12 12 9 3 l1 f1 9 | 9 1 220 9 1 9 2. Negative Binomial: n x 1 x 1 n X ~ NBn, , l2 f 2 x | x In this example, we throw a coin until 3 tails come up. Therefore, n 3, x 9 and 3 9 1 9 1 3 55 9 1 3 l2 f 2 9 | 9 By likelihood principle, l1 and l2 contain the same information. Thus, intuitively, the same conclusion should be achieved based on the two likelihood functions. However, classical statistical inference would result in bizarre conclusions from frequentist point of view. 1. Binomial: The reject rule is X c, c is some constant. Thus, in this example, p value P X 9 | 1 1 1 f1 9 | f1 10 | f1 11 | 2 2 2 0.075 0.05 32 1 f 12 | 2 Thus, we do not reject H 0 and conclude the coin is fair. 2. Negative Binomial: The reject rule is X c, c is some constant. Thus, in this example, p value P X 9 | 1 1 f 2 9 | f 2 10 | 2 2 0.0325 0.05 Thus, we reject H 0 and conclude the coin is not fair. 1.6 Choosing Decision Principle The “robust” Bayesian paradigm which takes into account uncertainity in the prior is fundamentally correct paradigm. 33