Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
L k i i e o h o d l 1 Likelihood, Bayesian and Decision Theory Kenneth Yu 2 History • The likelihood principle was first introduced by R.A. Fisher in 1922. The law of likelihood was identified by Ian Hacking. • "Modern statisticians are familiar with the notion that any finite body of data contains only a limited amount of information on any point under examination; that this limit is set by the nature of the data themselves…the statistician's task, in fact, is limited to the extraction of the whole of the available information on any particular issue." R. A. Fisher 3 Likelihood Principle • All relevant data in is contained in the likelihood function L(θ | x) = P(X=x | θ) Law of Likelihood • The extent to which the evidence supports one parameter over another can be measured by taking their ratio • These two concepts allow us to utilize likelihood for inferences on θ. 4 Motivation and Applications • Likelihood (Especially MLE) is used in a range of statistical models such as structural equation modeling, confirmatory factor analysis, linear models, etc. to make inferences on the parameter in a function. Its importance came from a need to find the “best” parameter value subject to error. • This makes use of only the evidence and disregards the prior probability of the hypothesis. By making inferences on unknown parameters from our past observations, we are able to estimate the true Θ value for the population. 5 • The Likelihood is a function of the form: L(Θ|X)Є{α P(X|Θ) : α > 0 } • This represents how “likely” Θ is if we have prior outcomes X. It is the same as the probability of X happening given parameter Θ • Likelihood functions are equivalent if they differ by constant α (They are proportional). The inferences on parameter Θ would be the same if based on equivalent functions. 6 Maximum Likelihood Method By Hanchao 7 Main topic include: • • • • 1. Why use Maximum Likelihood Method? 2. Likelihood Function 3. Maximum Likelihood Estimators 4. How to calculate MLE? 8 1. Why use Maximum Likelihood Method? Difference between: Method of Moments & Method of Maximum likelihood 9 • Mostly, same! • However, Method of Maximum likelihood does yield “good” estimators: 1. an after-the-fact calculation 2. More versatile methods for fitting parametric statistical models to data 3. Suit for large data samples 10 2. Likelihood Function • Definition: k f ( x ,..., x ; ), R n Let 1 , be the joint probability (or density) function of n random variables X 1 ,..., X n: x1 ,..., xn with sample values L( , x1 ,..., xn ) f ( x1 ,..., xn ; ) The likelihood function of the sample is given by: 11 • If X 1 ,..., X n are discrete iid random variable with probability function p( x, ), then, the likelihood function is given by L( ) P( X 1 x1 ,..., X n xn ) n P( X i xi ) i 1 n P( xi , ) i 1 12 • In the continuous case, if the density is f ( x, ) then, the likelihood function is given by n L( ) f ( xi , ) i 1 X 1 ,..., X nbe N ( , 2 ) iid random variables. i.e. Let Find the Likelihood function? n n L( , 2 ) i 1 ( xi ) 2 1 1 exp( ) exp( 2 n/2 n 2 (2 ) 2 2 ( x ) i i 1 2 2 13 ) 4. Procedure of one approach to find MLE • 1). Define the likelihood function, L(θ) • 2). Take the natural logarithm (ln) of L(θ) • 3). Differentiate ln L(θ) with respect to θ, and then equate the derivative to 0. • 4). Solve the parameter θ, and we will obtain ^ • 5). Check whether it is a max or global max • Still confuse? 14 Ex1: SupposeX 1 ,..., X n are random samples from ^a Poisson distribution with parameter λ. Find MLE ? We have pmf: p ( x) x e x! ; x 0,1,2,...; 0 Hence, the likelihood function is: n n x e i 1 xi ! L ( ) i xi i 1 e n n x! i i 1 15 Differentiating with respect to λ, results in: n x d ln L( ) i 1 i n d And let the result equals to zero: n x That is, d ln L( ) i 1 i n 0 d n x i 1 n i _ x Hence, the MLE of λ is: ^ _ X 16 Ex2: Let X 1 ,..., X n be N ( , 2 ) . 2 2 a) if μ is unknown and 0 is known, find the MLE for μ. b) if 0is known and 2 unknown, find the MLE for 2. c) if μ and 2 are both unknown, find the MLE 2 for ( , ) . • Ans: Let 2 , so the likelihood function is: n L( , ) (2 ) n / 2 exp( 2 ( x ) i i 1 2 ) 17 So after take the natural log we have: n n n ln L( , ) ln( 2 ) ln( ) 2 2 2 ( x ) i i 1 2 2 0 is known, we only need a). When 0 to solve the unknown parameter μ: n (ln L( , 0 )) n (x i 1 i ) 0 2 ( xi ) i 1 n x i 1 i 2 0 n 0 x 18 • b) When 0 is known, so we only need to solve one parameter 2 : n n (ln L( , )) 2 2 ( x ) i i 1 2 2 0 n ^ ^ 2 2 ( X ) i 0 i 1 n • c) When both μ and θ unknown, we need to differentiate both parameters, and mostly follow the same steps by part a). and b). 19 Reality example: Mic1 Sound localization Mic2 MCU 20 Robust Sound Localization IEEE Transactions on Signal Processing, Vol. 53, No. 6, June 2005 Noise reverberations Sound Source 21 The ideality and reality Mic1 Mic2 The received signal in 1meter and angle 60 frequency 1kHz 22 Amplitude Fourier Transform shows noise Frequency (100Hz) 23 Algorithm: 1. Signal collection (Original signal samples in time domain) m1 (t ) s(t ) n1 (t ) m2 (t ) s(t ) n2 (t ) 2. Cross Correlation (received signals after DFT, in freq domain) ~ arg max j M ( ) M ( ) e d 2 1 24 • However, we have noise mixing within the signal, so the Weighting Cross Correlation algorithm become: ~ arg max j W ( ) M ( ) M ( ) e d 1 2 • Where by Using ML method as “Weighting function ” to reduce the sensitive from noise & reverberations | M 1 ( ) || M 2 ( ) | W ( ) | N1 ( ) |2 | M 2 ( ) |2 | N 2 ( ) |2 | M 1 ( ) |2 25 The disadvantage of MLE • Complicated calculation (slow) -> it is almost the last approach to solve the problem • Approximated results (not exact) Reference: [1] Halupka, 2005,Robust sound localization in 0.18 um CMOS [2] S.Zucker, 2003, Cross-correlation and maximum-likelihood analysis: a new approach to combining cross-correlation fuctions [3]Tamhane Dunlop, “Statistics and Data Analysis: from Elementary to intermediate”, Chap 15. [4]Kandethody M. Ramachandran, Chris P. Tsokos, “Mathematical Statistics with Applications”, page 235-252. 26 Likelihood ratio test Ji Wang 27 Brief Introduction • The likelihood ratio test was firstly claimed by Neyman and E.pearson in 1928. This test method is widely used and always has some kind of optimality. • In statistics, a likelihood ratio test is used to compare the fit of two models, one of which is nested within the other. This often occurs when testing whether a simplifying assumption for a model is valid, as when two or more model parameters are assumed to be related. 28 Introduction about most powerful test To the hypothesisH 0 : 0 H1 : ,1 we have two test functions Y1 and Y2 , If E Y1 E Y2*, 1 ,then we called Y1 is more powerful than Y2 . If there is a test function Y satisfying the * inequality to the every test function Y2 , then we called Y the uniformly most powerful test. The advantage of likelihood ratio test comparing to the significance test • The significance test can only deal with the hypothesis in specific interval just like: H : 0 0 0 H1 : 1 but can not handle the very commonly hypothesis : H 0 : 0 0 H1 : 1 0 because we can not use the method of significance test to find the reject region. 30 Definition of likelihood ratio test statistic • X 1 ,....., X nare the random identical sampling from the family distribution of F={ f ( x, ) : }. For the test H0 : 0 0 H1 : 1 0 let x) MAX 0 l ( x1 ,..., xn ) MAX l ( x1 ,..., xn We call X )is the likelihood ratio of the above mentioned hypothesis. Sometimes we also called it general likelihood ratio. # From the definition of the likelihood ratio test statistics, we can find if the value of X )is small, the null hypothesis H 0 : 0 0 is more probably to occur than the alternative hypothesis H1 : 1, so it is reasonable for us to reject null hypothesis. Thus, this test reject H 0if X ) C 31 The definition of likelihood ratio test • We use X ) as the test statistic of the test : H 0 : 0 0 H1 : 1 0 and the rejection region is { X ) C,}the C satisfy the inequality P { X ) C} 0 level. Then this test is the likelihood ratio test of #If we do not know the distribution of X )under null hypothesis, it is very difficult for us to find the marginal value of LRT. However, if there is a statistic(T ( X ) )which is monotonous to the ,and we know its distribution under null hypothesis. Thus, we can make a significance test based on the T ( X ). 32 The steps to make a likelihood ratio test • Step1 Find the likelihood ration function of the sample . X 1 ,....., X n • Step2 Find the X ) , the test statistic or some other statistics which is monotonous to the X ) . • Step3 Construct the reject region by using the type 1 error at the significance level of . 33 • Example X 1 ,....., X n are the random samples having the pdf: f ( x, ) e( x , x R Please derive the rejection region of the hypothesis in the level H 0 : 0 H1 : 0 34 Solution: ● Step1: the sample distribution is : f ( X , ) e n i 1 ( x I ( x(1) ) and it is also the likelihood function, the parameter space is R, 0 {0} then we derived n n xi xi nx(1) MAX 0 l ( x1 ,..., xn ) e i1 , MAX l ( x1,..., xn ) e i1 35 ● Step2 the likelihood ratio test statistics X ) e nX (1) e 1 (2 nX (1) ) 2 We can just used 2nX (1) ,because it is monotonous to the x ) ● Step3 Under the null hypothesis,2nX (1) ~ 2 (2) , so the marginal valuec 2 (2) by calculating the P {2nX(1) C} 0 That is to say 2nX (1)is the likelihood ratio test statistics and the reject region is {2nX (1) 2 (2)} 36 Wald Sequential Probability Ratio Test Xiao Yu So far we assumed that the sample size is fixed in advance. What if it is not fixed? Abraham Wald(1902-1950) developed the sequential probability ratio test(SPRT) by applying the idea of likelihood ratio testing, which sample sequentially by taking observations one at a time. Hypothesis: H 0 : 0 ; H1 : 1 Ln (1 | x1 , x2 ,..., xn ) i 1 f ( xi | 1 ) n n ( x1 , x2 ,..., xn ) Ln (0 | x1 , x2 ,..., xn ) n f ( xi | 0 ) n i 1 • If ( x, x ,..., x ) A stop sampling and decide to 1 A B H not1reject • If A ( x , x ,..., x ) B continue sampling • If ( x , x ,..., x ) B stop sampling and decide to reject H n 1 2 n 0 n n 1 1 2 0 2 n n SPRT for Bernoulli Parameter • A electrical parts manufacturer receives a large lot of fuses from a vendor. The lot is regarded as “satisfactory” if the fraction defective p is no more than 0.1, otherwise it is regarded as “unsatisfactory”. H 0 : p p0 0.1; H1 : p p1 0.3 sn p1 1 p1 n p0 1 p0 n sn 0.10, 0.20 1.504 0.186n sn 1.540 0.186n Fisher Information 2 d ln f ( X | ) 2 d ln f ( X | ) I ( ) E E 2 d d score d ln f ( X | ) d Cramer-Rao Lower Bound d ln f ( X | ) E 0 d Var (ˆ) 1 nI ( ) Single-Parameter Bernoulli experiment • The Fisher information contained n independent Bernoulli trials may be calculated as follows. In the following, A represents the number of successes, B the number of failure. 2 2 A B ( A B )! I ( ) E 2 ln( f ( A; )) | E 2 ln( (1 ) ) | A! B ! 2 A B E 2 ( A ln( ) B ln(1 )) | E | 1 A n n(1 ) B n E 2 | 2 2 2 (1 ) (1 ) (1 ) We can see it’s reciprocal of the variance of the number of successes in N Bernoulli trials. The more the variance, the less the Fisher information. 42 Large Sample Inferences Based on the MLE’s ˆ d ln L( ) d d 2 ln L( ) d 2 Plug in the Fisher information of Bernoulli trials, we can see it’s consistent as we have learned. 1 N 0, nI ( ) • An approximate large sample (1-alpha)-level confidence interval(CI) is given by ˆ z /2 1 nI (ˆ) ˆ z /2 1 nI (ˆ) 43 Jaeheun kim Bayes' theorem Thomas Bayes (1702 –1761) -English mathematician and a Presbyterian minister born in London -a specific case of the theorem (Bayes'theorem), which was published after his death (Richard price) Bayesian inference • Bayesian inference is a method of statistical inference in which some kind of evidence or observations are used to calculate the probability that a hypothesis may be true, or else to update its previously-calculated probability. • "Bayesian" comes from its use of the Bayes' theorem in the calculation process. BAYES’ THEOREM Bayes' theorem shows the relation between two conditional probabilities P( A B) P( A | B) P( B) P( B | A) P( A) P( A | B) P( B) P( B | A) P( A) • we can make updated probability(posterior probability) from the initial probability(prior probability) using new information. • we call this updating process Bayes' Theorem Prior prob. New info. Using Bayes thm Posterior prob. MONTE HALL Should we switch the door or stay????? http://en.wikipedia.org/wiki/Monty_Hall_problem A contestant chose door 1 and then the host opened one of the other doors(door 3). Would switching from door 1 to door 2 increase chances of winning the car? Di ={Door i conceals a car} Oj ={Host opens Door j after a contestant choose Door1} p ( D1 ) p ( D2 ) p ( D3 ) 1 3 1 2 p (O3 | D2 ) 1 p (O3 | D1 ) p (O3 | D3 ) 0 p ( D1 | O3 ) p (O3 | D1 ) p ( D1 ) p (O3 | D1 ) p ( D1 ) p (O3 | D2 ) p ( D2 ) p (O3 | D3 ) p ( D3 ) 1 1 1 2 3 1 1 1 1 3 (when you stay) ( ) (1 ) (0 ) 2 3 3 3 p (O3 | D2 ) p ( D2 ) p ( D2 | O3 ) p (O3 | D1 ) p ( D1 ) p (O3 | D2 ) p ( D2 ) p (O3 | D3 ) p ( D3 ) 1 2 3 (when you switch) 1 1 1 1 3 ( ) (1 ) (0 ) 2 3 3 3 1 15.3.1 Bayesian Estimation Premises of doing a bayesian estimation: 1. Prior knowledge about the unknown parameter θ Trust me. I know this θ! 2. The possibility distribution of θ : π (θ) (prior distribution) Zhenrui & friends General equation: π*(θ):posterior distribution f(x1,x2,…xn| θ) likelihood function of θ based on observed values x1,x2,…,xn . ( ) θ: unknown parameter from a distribution with pdf/pmf f (x | θ). Considered as r.v. in Bayesian estimation f ( x1 , x2 , , xn | ) ( ) f ( x , x , , x 1 π (θ):prior distribution 2 n | ) ( )d Marginal p.d.f. of X1,X2,…Xn, Just a normalizing constant to make ( )d 1 f ( x1 , x2 , , xn | ) ( ) f ( x1 , x2 , , xn ) ( ) f ( X | ) ( ) The µ* and σ*2 of π*(θ) are called posterior mean and variance, repectively. µ* can be used as a point estimate of θ (Bayes estimate) 51 Bayesian Estimation continued A question: If I only know the possible value range of θ, but can’t summarized it in the form of a possibility distribution. Can I still do the Bayesian estimation? No! To apply the Bayes’ theorem, every term in the equation has to be a probability term. π (θ) : √ vs θ: x Criticisms of Bayesian approach: 1. Perceptions of prior knowledge differ from person to person. ‘subjective’. 2. Too fuzzy to quantify the prior knowledge of θ in the form of a distribution Conjugate Priors: ( ) f ( X | ) ( ) a family of prior distributions that the posterior distribution is of the same form of the prior distribution Examples of Conjugate Priors( from text book): Example 15.25,15.26 • Normal distribution is a conjugate prior on µ of N(µ, σ2 ) )(if σ2 is already known) • Beta distribution is a conjugate prior on p of a Binominal distribution Bin(n,p) 52 15.3.2 Bayesian Testing simple vs simple hypothesis test : 0* * ( 0 ) H 0 : 0 H1 : 1 a ab a 0 f ( x1 , x2 , , xn | 0 ) b (1 ) 1 0* ab * 1 0 ( 0 ), 1 (1 ) 1 0 Prior probability of H0 and H1 * b 1 f ( x1 , x2 , , xn | 1 ) b 1 f ( x1 , x2 , , xn | 1 ) a 0 f ( x1 , x2 , , xn | 0 ) * 1 * 0 A Bayesian test rejects H0 if k * 1 * 0 k >0 is a suitably chosen critical constant. A large value of k corresponds to a small value of α 53 Bayesian Testing continued Bayesian test vs Neyman-Pearson likelihood ratio test (15.18) Neyman-Pearson Lemma: Bayesian test: L(1 | x1 , x2 , , xn ) f ( x1 , x2 , , xn | 1 ) k L( 0 | x1 , x2 , , xn ) f ( x1 , x2 , , xn | 0 ) 0 1* b 1 f ( x1 , x2 , , xn | 1 ) k * ( )k * 0 a 0 f ( x1 , x2 , , xn | 0 ) 1 Bayesian test can be considered as a specialized Neyman-Peanson likelihood test where the probabilities of each hypothesis (H0 & H1 )being true is known:π0 & π1 If 0 1 1,/ 2 1* f ( x1 , x2 , , xn | 1 ) k 0* f ( x1 , x2 , , xn | 0 ) The Bayesian test becomes the Neyman-Pearson likelihood ratio test 54 Bayesian Inference for one parameter Bingqi Cheng A biased coin • Bernoulli random variable • Prob(Head)= ϴ • ϴ is unknown Bayesian Statistics Three Ingredients: • Prior distribution Initial guess or prior knowledge on parameter ϴ, highly subjective • Likelihood function Fits or describes the distribution of real data ( e.g. a sequence of heads or tails when toss the coin) • Bayes Theorem Update the prior distribution with real data Prob(ϴ | data) ϴ Posterior distribution Prior Distribution Beta distributions are conjugate prior to Bernoulli distributions If prior is Beta and likelihood function is Bernoulli, then posterior is Beta Prior Distribution 3 2.5 Density 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 Likelihood function Posterior Distribution Likelihood function For this biased coin: Prior distribution Calculation Steps Posterior Distribution Posterior Distribution 7 6 Density 5 4 3 2 1 0 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 Predictive Probability Bayesian VS M.L.E with calculus method Back to the example of biased coin, still we have 20 trials and get 13 heads. Xiao Yu 20 f ( p) p13 (1 p)7 13 20 f '( p) p12 (1 p)6 (13 20 p) 0 13 p 0.65 64 Jeffreys Prior p(ˆ) det I (ˆ) 65 • M.L.E with calculus method: Find the parameter quick and directly, if possible. -> A huge step • Bayesian Initial guess + approximation + converge -> another startline + small step + maybe not best value Why bother to use Bayesian? With large amount of data, the computation of Bayesian is more easy to handle. 66 log L( p, C | x1 ,..., xn ) n log( p j 1 1 1 1 ( x 1 )2 /2 C12 ( x 2 )2 /2 C22 e p2 e ) 2 C1 2 C2 • This is a Gaussian Mixture, observations are vectors, C is the covariance matrix. Find the maximum likelihood estimate for a mixture by direct application of Calculus is tough. 67 Bayesian Learning • The more evidence we have, the more we learn. The more flips we do, the more we know about the probability to get a head, which is the parameter of binomial distribution. An application: EM(Expectation Maximization) algorithm which can beautifully handle with some regression problems. 68 Two coins Game: Suppose now that there are two coins which can be flipped. The probability of heads is p1 for the first coin, and p2 for the second coin. We decide on each flip which of the two coins will be flipped, and our objective is to maximize the number of heads that occur.(p1 and p2 are unknown) 69 Matlab code for the strategy • • • • • • • • • • • • • • • • • • • • • • • • • • • function [] = twocoin2(p1,p2,n) H1=0;T1=0; H2=0;T2=0; syms ps1; syms ps2; for k=1:n, temp = int(ps2^H2*(1-ps2)^T2,0,ps1); p(k) = double(int(ps1^H1*(1ps1)^T1*temp,0,1)/(beta(H1+1,T1+1)*betH2+1,T2+1))); if rand < p(k), guess(k) = 1; y(k) = rand < p1; H1 = H1 + y(k); T1 = T1 + (1 - y(k)); else guess(k) = 2; y(k) = rand < p2; H2 = H2 + y(k); T2 = T2 + (1 - y(k)); end end disp('Guesses: ') tabulate(guess) disp('Outcomes: ') tabulate(y) figure(2) plot(p) end P1=0.4, p2=0.6 Value of L(p1>p2|H1,T1,H2,T2) 70 Statistical Decision Theory ABRAHAM WALD Kicheon Park (1902-1950) • Hungarian mathematician • Major contributions - geometry, econometrics, statistical sequential analysis, and decision theory • Died in an airplane accident in 1950 Hans Schneeweiss “Abraham Wald” Department of Statistics, University of Munich Akademiestr. 1, 80799 MÄunchen, Germany 71 Why decision theory is needed? Limits of classical statistics I. Prior information and Loss II. Initial and final Precision III. Formulational Inadequacy 72 Limit of Classical Statistics • • • Prior information and Loss - relevant effects from past experience & losses from each possible decision Initial and final Precision - Before and After observation of sample information which is result of long series of identical experiments Formulational Inadequacy - Limit to make meaningful decision to be reached in the majority problem 73 Classical statistics vs. Decision Theory • Classical statistics - Direct use of sample information • Decision theory - combine the sample information with other relevant aspects of the problem for the best decision → The goal of decision theory is to make decision based on not only the presence of statistical knowledge but also the uncertainties (θ) that are involved in the decision problem 74 Two types of relevant information I. Knowledge of the possible consequences of the decisions → loss of result by each possible decisions II. Prior information →effects from past experience about similar situation 75 Statistical Decision Theory - Elements χ Sample Space Χ Unknown parameter θ , Decision Space “Abraham Wald”, Wolfowitz, Annals of Mathematical Statistics “Statistics & Data Analysis”, Tamhane & Dunlop, Prentice Hall “Statistical Decision Theory”, Berger, Springer-Verlag Mun Sang Yue 76 Statistical Decision Theory - Eqns • Loss Function Or • Risk Function 77 Statistical Decision Theory – Decision Rules • Decision Rules - Minimax min {max } • Decision Rules - Bayes 78 Statistical Decision Theory - Example • A retailer must decide whether to purchase a large lot of items containing an unknown fraction p of defectives. Before making the decision of whether to purchase the lot (decision d1) or not to purchase the lot (decision d2), 2 items are randomly selected from the lot for inspection. The retailer wants to evaluate two decisions rules formulated. Prior π(p) = 2(1-p) 79 Example - Continue No. of Defectives x Decision Rule δ1 Decision Decision Rule δ2 Decision 0 d1 d1 1 d2 d1 2 d2 d2 • Loss Functions L(d1,p) = 8p-1, and L(d2,p)=2 • Risk Functions – R(δ1,p) = L(d1,p) P(δ1 chooses d1 | p) + L(d2,p) P(δ1 chooses d2 | p) = (8p-1) P(X=0 | p) + 2 P(X=1 or 2 | p) – R(δ2,p) = L(d1,p) P(δ2 chooses d1 | p) + L(d2,p) P(δ2 chooses d2 | p) = (8p-1) P(X=0 or 1 | p) + 2P(X=2 | p) 80 R Example - Continue 4 R(δ2,p) 3.5 3 max R(δ1,p) = 2.289 2.5 2 R(δ1,p) 1.5 max R(δ2,p) = 3.329 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 p -0.5 -1 -1.5 81 Statistical Decision Theory - Example • A shipment of transistors was received by a radio company. A sampling plan was used to check the shipment as a whole to ensure contractual requirement of 0.05 defect rate was not exceeded. A random sample of n transistors was chosen from the shipment and tested. Based upon X, the number of defective transistors in the sample, the shipment will be accepted or rejected. 82 Example (continue) • Proportion of defective transistors in the shipment is θ. • Decision Rule: a1 accept lot if X/n ≤ 0.05 a2 reject lot if X/n ≥ 0.05 • Loss Function: L(a1,θ) = 10*θ ; L(a2,θ) = 1 • π(θ) can be estimated based on prior experience • R(δ,θ) can then be calculated 83 Summary • Maximum Likelihood Estimation selects an estimate of the unknown parameter that maximizes the likelihood function. • The Likelihood Ratio Test compares the likelihood of the observed outcomes under the null hypothesis to the likelihood under the alternate hypothesis. • Bayesian methods treat unknown models or variables as random variables with known distributions instead of deterministic quantities that happened to be unknown 84 Summary(Continue) • Statistical Decision Theory moves statistics from its traditional role of just drawing inferences from incomplete information. The theory focuses on the problem of statistical actions rather than inference. “Here in the 21st Century … a combination of Bayesian and frequentist ideas will be needed to deal with our increasingly intense scientific environment.” Bradley Efron, 164th ASA Presidential Address 85 THANK YOU! Questions? 86