Download Introduction to Frequentist and Bayesian Approaches

Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize and analyze the parameters of the model with an objective in mind. • Example : SBP of Cancer Patients vs. Normal patients Cancer: 145, 165, 134, 120, 112, 156, 145, 133, 135, 120 Normal: 138, 120, 112, 110, 128, 134, 128, 109, 138, 140 Objective: Do cancer patients have higher SBP than the normal patients? 1 Population of cancer patients with a probability distribution Population of normal patients with a probability distribution normal cancer 𝜇1 𝜇2 Systolic blood pressure Objective is to test the Hypothesis: 𝜇2 > 𝜇1 Does the data support this hypothesis? 2 Assumption: The data is random and is generated from the normal distributions? • Random Variable 𝑋 𝑋∶𝑆→𝑅 𝑆 is the collection of all subjects. What we observe is one realization 𝑋(𝑠) • Random Sample: {𝑋1 , 𝑋2 , … , 𝑋𝑛 } We collect a sample of subjects {𝑠1 , 𝑠2 , … , 𝑠𝑛 } 3 Observed Sample: {𝑋 𝑠1 , 𝑋 𝑠2 , … , 𝑋 𝑠𝑛 } Assumption: {𝑠1 , 𝑠2 , … , 𝑠𝑛 } – Simple Random Sample (equally likely than any other sample) • Multivariate Observations 𝑿= 𝑋1 𝑋2 : 𝑆 → 𝑅𝑘 ⋮ 𝑋𝑘 An observed vector is one realization of this, i.e., 𝑿(𝑠) 4 Random Sample: {𝑿1 , 𝑿2 , … , 𝑿𝑛 } Observed sample is a realization of {𝑿 𝑠1 , 𝑿 𝑠2 , … , 𝑿 𝑠𝑛 } Note: If the simultaneous inference is to made on its components, the probability statement should be viewed in terms of probability of observing {𝑠1 , 𝑠2 , … , 𝑠𝑛 } 5 Stochastic Process {𝑋 𝑡 , 0 ≤ 𝑡 < ∞} Observed value of this is one realization {𝑋 𝑡, 𝑠 , 0 ≤ 𝑡 < ∞} Can we describe a probability distribution of {𝑋 𝑡 , 0 ≤ 𝑡 < ∞}? Kolmogorov Consistency Theorem says that probability distribution can be described. 6 These are three realizations with 𝑋 0 = 0 7 Discrete time points {𝑋 𝑡 , 𝑡 = ⋯ , −2, −1, 0, 1, 2, ⋯ } If this process is stationary, then a probability model for 𝑋(𝑡) can be described in a concise way. For example, 𝑋 𝑡 = 𝜌𝑋 𝑡 − 1 + 𝜖 𝑡 = ∞ 𝑘 𝜖(𝑡 𝜌 𝑘=0 − 𝑘), where {𝜖 𝑡 } is white noise. 8 Image Process: 9 {𝑋 𝑝 , 𝑝 𝜖 𝑄}, where 𝑄 is the set of all pixels. Note that what we observe is a realization of this {𝑋 𝑝, 𝑠 , 𝑝 𝜖 𝑄} 10 The same can be said about weather map. 11 Data Analysis Generally speaking, we perform one or more of the following tasks in data analysis (statistical inference) • Estimate the model • Hypothesis testing • Predictive analysis Given the sample data, objective is to make inference about the population described by the probability model. All inferences are based on probability model assumed. 12 Estimation 𝜃 − 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝜃 Think of estimating any parameters of a probability model. For example, estimating 𝛽0 and 𝛽1 of a regression model 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 How good is the estimate 𝜃? Well, you might say that if 𝜃 ≅ 𝜃, it is a good estimate. Not so simple! Note that 𝜃 is unknown. 13 Frequentist’s Interpretation Note that 𝜃 depends on the sample we observe. Sample 𝜽 |𝜽 − 𝜽| 𝜽 |𝜽 − 𝜽| …… -- -- -- -- …… -- -- -- -- observed observed observed observed observed …… -- -- -- -- ⋮ ⋮ ⋮ ⋮ ⋮ 𝜃 is better than 𝜃 if the average of 𝜽 − 𝜽 𝟐 is smaller than 𝟐 the average of 𝜽 − 𝜽 , i.e., 𝐸 𝜃−𝜃 2 <𝐸 𝜃−𝜃 2 for all 𝜃. 14 𝜃 is better than 𝜃 if 𝐸 𝜃−𝜃 2 <𝐸 𝜃−𝜃 2 for all 𝜃. A best estimate, in this sense, is of course not possible. If 𝜃0 ≡ 𝜃0 irrespective of the observed sample, then 𝐸 𝜃0 − 𝜃 2 = 0 for 𝜃 = 𝜃0 We restrict to a class of estimators, and then try to find best Estimate within this class. For example, we may consider a class of all unbiased estimators. 15 Theories are well developed for achieving best estimates among the class of unbiased estimates for simple probability models. For complicated model, we can always fall back to maximum likelihood estimates. Obtain the estimate by maximizing the likelihood function 𝐿 𝜃 𝑥1 , 𝑥2 , … , 𝑥𝑛 = Pr(𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝜃) For small sample size 𝑛, this may not always yield good estimate, but for large sample size 𝑛, this generally yield optimal estimates. 16 Asymptotic Optimality of Maximum Likelihood Estimate {𝜃𝑛 } – sequence of asymptotically normal estimates 1 𝜃𝑛 − 𝜃 −2 𝑉𝑛 𝜃 →𝑑 𝑁(0, 𝐼) as 𝑛 → ∞ 𝑉𝑛 𝜃 can be interpreted as asymptotic variance of {𝜃𝑛 }. 𝑉𝑛 𝜃 ≥ 𝐼𝑛 𝜃 −1 , 𝐼𝑛 (𝜃) - Fisher Information Matrix Under regular probability models, maximum likelihood estimates {𝜃𝑀𝐿 } achieves the lower bound. 17 Bayesian Interpretation Prior Distribution - 𝜋(𝜃) Through this we might say that some values of 𝜃 are more likely than other values. 𝜃 is better than 𝜃 if 2 𝐸 𝜃 − 𝜃 𝜋 𝜃 𝑑𝜃 < 2 𝐸 𝜃 − 𝜃 𝜋(𝜃)𝑑𝜃. A best estimate is now possible; for example, 𝜃𝐵 = 𝐸(𝜃|𝑑𝑎𝑡𝑎) The RHS is the expectation with respect to the posterior distribution of 𝜃. 18 Prior Distribution - 𝜋 𝜃 Really? Where did it come from? You may not believe this, but we are really talking in terms of a statistical philosophy. Can you really believe that the true state of nature 𝜃 is random? normal cancer 𝜇1 𝜇2 Systolic blood pressure 19 𝜇1 and 𝜇2 are supposed to be fixed mean SBPs of the normal and cancer populations. Now, we are saying that they are random. Bayesian Paradigm 𝜃 is never a fixed value; under most circumstances some values of 𝜃 are more likely than other values. Before a data is analyzed, we should explore this prior. Then update it based on the information provided by the data. Prior: 𝜋(𝜃) Data: 𝑓(𝑑𝑎𝑡𝑎|𝜃) Posterior: 𝜋(𝜃|𝑑𝑎𝑡𝑎) All information about 𝜃 is contained in the posterior. 20 Example: 1 in 1,000 in the population carry a particular genetic disorder. Certain tests on a person are performed, and data is collected Data: {𝑥1 , 𝑥2 , … , 𝑥𝑛 } 𝑓 𝑑𝑎𝑡𝑎 + , Prior: 𝜋 + = 𝑓(𝑑𝑎𝑡𝑎|−) 1 1000 Posterior: 𝜋 + 𝑑𝑎𝑡𝑎 = 𝑓 𝑓 𝑑𝑎𝑡𝑎 + 𝑑𝑎𝑡𝑎 + 𝜋 + +𝑓 𝜋 + 𝑑𝑎𝑡𝑎 − 𝜋 − 21 The main issues with Bayesian inference are (1) Appropriateness of the prior (2) Computation of the posterior distribution {𝑋1 , 𝑋2 , … , 𝑋𝑛 } random sample from 𝑁(𝜇, 𝜎 2 ) Prior: 𝜇 ~ 𝑁(𝜈0 , 𝜎 2 𝜔0 ) 2 𝜎 2 −1 ~ 𝜒𝑚 This is a conjugate prior because the posterior distribution is of same form as the prior distribution. Is this prior appropriate? 22 Prior: 𝜇 ~ 𝑁(𝜈0 , 𝜎 2 𝜔0 ) 2 𝜎 2 −1 ~ 𝜒𝑚 If nothing is known about (𝜇, 𝜎 2 ), 𝜔0 ≈ ∞, 𝑚 = 1, 𝜈0 = 0. This gives almost flat prior for 𝜇 and 𝜎 2 . There are other ways to assign non-informative priors. Note that if Prior: 𝜇 ~ 𝑁(𝜈0 , 𝜏02 ) 2 𝜎 2 −1 ~ 𝜒𝑚 then we will have computational problem of computing posterior distribution. 23 Computation of the posterior There are two popular techniques of computing posterior distribution: 1. Metropolis-Hasting Algorithm 2. Gibbs Sampler These techniques can be used effectively for complex probability model and reasonable priors. 24 Frequentist vs. Bayesian Frequentist Bayesian All data information is contained in the likelihood function. All data information is contained in the likelihood function and the prior The estimates are viewed in terms of how they behave on the average Estimates are viewed in terms of where they are located in the posterior Estimates are generally obtained by maximizing the likelihood function. Techniques include Newton-Raphson, EM-algorithm Estimates are obtained from the posterior. Techniques include Gibbas Sampler, Metropolis-Hasting etc. 25 Mixture Models Suppose the population is a mixture of two or more populations. 𝑦𝑖 = 𝛽0𝑖 + 𝛽1𝑖 𝑥𝑖 + 𝜖𝑖 𝛽0𝑖 ~ 𝑀𝑖𝑥𝑡𝑢𝑟𝑒 𝑜𝑓 𝑁𝑜𝑟𝑚𝑎𝑙𝑠 𝛽1𝑖 Bayesians would have a good answer to estimate this model than frequentists would. 26 Hypothesis Testing Think about how it started in statistical literature. Data: {𝑋1 , 𝑋2 , … , 𝑋𝑛 } drawn from a probability model. 𝐻: 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 associated with the probability model Does the data support this hypothesis? Bayesians had an answer to this, but they were not popular at the time. Ans. 𝑃(𝐻|𝑑𝑎𝑡𝑎) 27 𝒑 − 𝒗𝒂𝒍𝒖𝒆 (Fisher) {𝑋1 , 𝑋2 , … , 𝑋𝑛 } drawn from 𝑁(𝜇, 𝜎 2 ) Hypothesis 𝐻: 𝜇 = 𝜇0 Compute 𝑥 − 𝜇0 = 𝑡 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = Pr 𝑜𝑏𝑠𝑒𝑟𝑣𝑖𝑛𝑔 𝑎 𝑣𝑎𝑙𝑢𝑒 𝑡 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑒𝑥𝑡𝑟𝑒𝑚𝑒 𝐻) If this 𝑝 − 𝑣𝑎𝑙𝑢𝑒 is vey small (< 0.05), then the data provide very little evidence in support of the hypothesis. Conclusion: Reject the Hypothesis 28 Analysis of Variance (ANOVA) ANOVA is one of the most popular statistical tools of analyzing data. Factor 1 Y Factor 2 A Response Variable Factor 3 Does Y (the response) depends on any of the factors? 29 Example 1: You are doing a research on mpg (miles per gallon) for a brand of automobiles. Question: What effects mpg? Wind speed mpg Air temperature Air moisture Do wind speed, air temperature, and air moisture effect mpg? 30 Example 2: Research Question: Does blood pressure (BP) depend on weight and gender? Weight BP Gender 31 There is a variation in BP. Some is due to weight, and some is due to gender. BP * Female * Male * ** * * * * * * * * * Weight 32 Concept: Variation(BP) = Variation(Weight) + Variation(Gender) + Variation(Error) These variation can be described by Sums of Squares … 2 SS(BP) = SS(Weight) + SS(Gender) + SS(Error) 𝑑𝑓𝐵𝑃 = 𝑑𝑓𝑤 + 𝑑𝑓𝑔 + 𝑑𝑓𝑒 𝑑𝑓 is the degrees of freedom that represent the effective number of terms in the sums of squares 33 F-Statistics Weight: Test Statistic 𝐹1 = 𝑆𝑆 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑𝑓𝑤 𝑆𝑆 𝐸𝑟𝑟𝑜𝑟 𝑑𝑓𝑒 = 𝑀𝑆𝑤 . 𝑀𝑆𝐸 Hypothesis 𝐻: Weight is not a factor in BP 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃(𝑜𝑏𝑠𝑒𝑟𝑣𝑖𝑛𝑔 𝑣𝑎𝑙𝑢𝑒 𝑚𝑜𝑟𝑒 𝑒𝑥𝑡𝑟𝑒𝑚𝑒 𝑡ℎ𝑎𝑛 𝐹1 |𝐻) If p-value (<0.05), then there is little evidence that weight is not a factor Gender: Test Statistics 𝐹2 = 𝑆𝑆 𝑔𝑒𝑛𝑑𝑒𝑟 𝑑𝑓𝑤 𝑆𝑆 𝐸𝑟𝑟𝑜𝑟 𝑑𝑓𝑒 = 𝑀𝑆𝐺 𝑀𝑆𝐸 Same can be done to see if gender is a factor. 34 Neyman – Pearson Lemma Basis for Classical Hypothesis Testing 𝐻0 : Null hypothesis 𝐻𝑎 : Alternative Hypothesis (Research Hypothesis) TS: Test Statistics Decision Rule Conclusion Type-I Error: False Discovery Type-II Error: False Non-Discovery Devise a decision rule so that 𝛼 = Pr(False Discovery) is very small (=0.05). Through Neyman-Pearson Lemma, a most powerful decision rule can be obtained. 35 𝐻0 : 𝜇 = 𝜇0 Uniformly Most Powerful Unbiased Decision Rule is 𝑋 − 𝜇0 > 𝑘, where 𝑘 is such that Pr X − 𝜇0 > 𝑘 = 0.05. Note that this is a frequentist method since the probability statement should be interpreted in a frequentist manner. 36 Likelihood Approach Neyman-Perason Lemma works only on simple probability models. Test Statistics −2 log 𝜆 = 2(max log 𝐿 − max log 𝐿) 𝐻 If the hypothesis 𝐻 is correct, the −2 log 𝜆 should be closed to 0. Thus, we reject the hypothesis 𝐻 if −2 log 𝜆 > 𝑐 The cut-off point 𝑐 can be obtained through asymptotic distribution of −2 log 𝜆, which is usually 𝜒 2 . 37 Model Selection Suppose you want choose one model out of several. This is a type of multiple hypotheses problem. Regression: 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ 𝛽𝑘 𝑥𝑘 + 𝜖 Not all predictors 𝑥1 , 𝑥2 , … 𝑥𝑘 are significant, and you want to select the set of significant predictors. This can be viewed as selecting one of the several models 𝑀𝑗 , 𝑗 = 1,2, … , 𝑚 −2 log 𝜆𝑀𝑗 = 2(max log 𝐿 − max log 𝐿) 𝑀𝑗 Choose the model that yields the smallest −2 log 𝜆𝑀𝑗 38 This yields a biased selection, meaning that a model with higher number of parameters has a better chance of being selected. AIC or BIC Information criteria 𝐴𝐼𝐶 = 2 max log 𝐿 − 2 ∗ # 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 in Mj 𝑀𝑗 𝐵𝐼𝐶 = 2 max log 𝐿 − log 𝑛 ∗ # 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑠 𝑖𝑛 𝑀𝑗 𝑀𝑗 Select the model with the highest value of AIC (or BIC) 39 Bayesian Hypothesis Testing Data: {𝑋1 , 𝑋2 , … , 𝑋𝑛 } drawn from 𝑁(𝜇, 𝜎 2 ) Hypothesis 𝐻: 𝜇 = 𝜇0 Prior: 𝑝0 = 𝑃𝑟 𝜇 = 𝜇0 , 1 − 𝑝0 = Pr(𝜇 ≠ 𝜇0 ) Posterior: 𝑝 = Pr(𝜇 = 𝜇0 |𝐷𝑎𝑡𝑎), 1 − 𝑝 = Pr(𝜇 ≠ 𝜇0 |𝐷𝑎𝑡𝑎) Bayes Factor: 𝐵𝐹 = 𝑝 1−𝑝 𝑝0 1−𝑝0 If this Bayes factor (𝐵𝐹 ≥ 20), data has sufficient evidence to support the hypothesis 𝐻: 𝜇 = 𝜇0 . 40 Frequentist Vs. Bayesian Note that both 𝑝 − 𝑣𝑎𝑙𝑢𝑒 and classical hypothesis tests are frequentists since the statements are made in terms of probability. 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = Pr 𝑜𝑏𝑠𝑒𝑟𝑣𝑖𝑛𝑔 𝑎 𝑣𝑎𝑙𝑢𝑒 𝑡 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑒𝑥𝑡𝑟𝑒𝑚𝑒 𝐻) 𝛼 = 𝑃𝑟 𝐹𝑎𝑙𝑠𝑒 𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑦 = Pr((𝑋 − 𝜇0 | > 𝑐) The Bayes Factor is used in Bayesian tests which is based on the posterior probability Pr(𝐻|𝐷𝑎𝑡𝑎) 41 Multiple Hypotheses: Consider 1000 independent tests each at Type-error of α = 0.05. Then 5% of the null hypotheses would be falsely rejected. In other words, if 50 of the hypotheses were rejected, there is no guarantee that they were not all falsely rejected. FWER: m = # of hypotheses π = P(One or more falsely rejected hypotheses) = 1 – (1 )m  1 (1 )1/ m   / m (Bonferroni Correction) If m is large, α would be very small. Thus the power of detecting any true positive would be very small. Sequential Bonferroni Corrections: Let p[1]  p[2]  ...  p[m] be the p-values of independent tests with corresponding null hypotheses H(1), H(2),...., H(m). Holm’s Method (Holm, 1979; Scand. Statist.) • If p   / m , accept all nulls. [1] • If p[1]   / m, reject H ; if p   /(m 1) , accept the rest of nulls. [2] (1) • Continue until first j such that p   /(m  j .1) In that case reject [ j] all H ,i  j 1,and accept the rest of nulls. (i) Simes Method (Biometrika, 1986): • If p   , reject all nulls. [m] • If not, but if p[m1]   / 2 , reject all H(i),i 1,2,..., m 1  • Continue until first p[i]  . In that case reject all H( j), j 1,2,...,i m  i 1 Note: Both Holm’s and Simes methods are designed to refine the FWER. False Discovery Rate (FDR): Benjamini and Hochberg (1995), JRSS When the number of hypotheses m is very large (say in thousands), and if each individual hypothesis is not important, then FWER criterion is not very useful since it yields few discoveries. For example, in a microarray data analysis, the objective is to detect potential genes for future exploration. Here, each individual gene is not important. In such cases, tests with a controlled FWER would yield few discoveries. FDR = Expected proportion of false rejections. Accept Null Reject Null Total True Null U V True Alternatives T S m 0 mm 0 m- R R m FDR = E [ V ], where, V  0 if R  0 R R = E [ V | R  0] P( R  0) R Note that FWER = P(R>0) Benjamini and Hochberg proved that the following procedure produces FDR  q : i p  q, then reject all Let k be the largest integer i such that [i] m H ( j) , j 1,2,..., k. The result was proved under the assumption of independent test statistics. It was later extended to a positively correlated test statistics by Benjamini and Yekutieli, 2001; Ann. Stat. Bayesian Interpretation (Storey, 2003, Ann. Stat.) V pFDR  E[ | R  0] R H i :i  0 vs. H ai :i  0, 0 i 1,2,..., m Let Ti be test statistics that reject H i if Ti  c. 0 Ti , i 1,2,..., m are independently distributed.  , ,....,m are i.i.d. with p  P(i  H i )  0, then 1 2 0 pFDR  P(H i | Ti  c) 0 Note: pFDR is a posterior version of the Type-I error Directional Hypothesis Problem (Three decision problem): Suppose H i :  0 is rejected, but it is also important to find the direction 0 i of i , i.e., i  0 or i  0. So the problem is to find subsets S and S of {1,2,..., m} such that S  {i : i  0} and S  {i : i  0} Example: Gene selection When the genes are altered under adverse condition, such as cancer, the affected genes show under or over expression in a microarray. X i  Expression Level X i ~ P(i , ) H i :i  0 vs H i :i  0 or H i :i  0 0 The objective is to find the genes with under expressions and genes with over expressions. Directional Error (Type III error): Type III error is defined as P( Selection of false direction if the null is rejected). The traditional method does not control the directional error. For example, | t | t / 2 , and t  t / 2 , an error occurs if   0. Sarkar and Zhou (2008, JSPI) Finner ( 1999, AS) Shaffer (2002, Psychological Methods) Lehmann (1952, AMS; 1957, AMS) Main points of these work is that if the objective is to find the true direction of the alternative after rejecting the null, then a Type III error must be controlled instead of Type I error. Bayesian Decision Theoretic Framework i :   , H i :   H i :i   (say, 0) , H  i 0  i 0 0 0 Suppose  , ,...,m are generated from 1 2  ( )  p  ( )  p  ( )  p  ( ) 0 0 where  - ( )  g ( ) I (  0),  ( )  I (  0), g ( ) I (  0) 0  g ()  density with support contained in (-,0) g ()  density with support contained in (0, ) g and g  could be trucated densities of a density on  . The skewness in the prior is introduced by (p , p , p ). -1 0 1 p  p reflects that the right tail is more likely than the left tail. p-  0 (or p  0) would yield a one - tail test. p-  p with g- and g  as truncated of a symmetric density would yield a two - tail test. p- and p can be assigned based on what tail is more important. Loss Function L(θ, d)  m  i 1 L ( , d ) i i i where di  (di , d i , di ) taking values 0 (1,0, 0) for selecting H i (0, 1, 0) for selecting H i 0 (0, 0, 1) for selecting H i Let   ( i , i , i ) be a randomized rule. 0 i The average risk for a decision rule   ( ,..., m ) is given by 1 r ( )  pr (  )  p r (0)  pr (  ) 00 where r (  )    0 R(i ,i )  (i )di i i r (  )    0 R(i ,i ) (i )di 1 i i r (0)   R(0,i ) 0 i For a fixed prior  , decision rules can be compared by comparing the space S( )  {(r (  ), r (0), r (  )) :  D*} 0 consider the class of all rules  for which R(0, ) are the same slope depends on p and p p  p r ( ) Bayes Rule r ( ) Remark : This theorem implies that if apriori it is known that H i is more likely than H i ( p  p ), then the average risk of the Bayes rule in the positive direction will be smaller than average risk in the negative direction. For the "0 -1", this would mean that the expected number of falsely delected genes in the positive direction would be less than the expected number of falsely detected genes in the negative direction.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Frequentist and Bayesian Approaches