Download BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD …

BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS Andrew C. Heath PRE-HISTORY (STATISTICS 101) Binomial distribution – gives the probabilities of various possible numbers of ‘successful’ outcomes in a fixed number of discrete trials, where all trials have the same probability of success. Probability that X is equal to a particular value x (x= 0, 1, 2…n) is given by  n x P  X  x     p (1  p ) n x  x Useful for genetics (e.g. transmission versus non-transmission of an allele)! LIKELIHOOD Focus on probability (“likelihood”) of the data, as a function of model parameters. E.g. Sham (1998) uses Neel & Schull (1954) data on opalescent dentine in random sample of 112 offspring of an affected parent, found 52 affected, 60 normal. Compatible with hypothesis of rare autosomal dominant? Does the observed population (.464) differ from expected proportion of 0.5? Likelihood function for segregation ratio p is  n r L p     p (1  p)n r r MAXIMUM LIKELIHOOD ESTIMATION = Find the maximum value of the likelihood function in the range 0  p 1. In more difficult problems, usual to maximize the log-likelihood, since computationally more convenient, & this also maximizes the likelihood.  p In this simple case, maximum likelihood estimate (MLE) of p, = r/n. For the opalescent dentine data, ignoring the constant term involving n, r which does not vary as a function of p, log-likelihood function is ln L(p) = 52 ln(p) + 60 ln(1–p) Figure 2.1 Log-likelihood function of the segregation ratio for the opalescent dentine data (from Sham, 1998) p LIKELIHOOD-RATIO STATISTIC Likelihood ratio statistic: twice the difference between the log likelihood of the  data at the MLE (i.e. p ), L1, and the log likelihood of the data at the hypothesized value of 0.5, L0: 2(ln L1 – ln L0). For the segregation ratio example,  2ln L1  ln L0   2 r  ln r  n  r ln nr  n  ln 1 n n 2  = 0.57 for the opalescent dentine example. Likelihood-ratio statistic in this case is distributed as chi-square on one degree of freedom, hence non-significant. MATRIX ALGEBRA BASICS A–1 Inverse of A AT or A' Transpose of A |A| Determinant of A A  B rxc cxr A postmultiplied by B: conformable for multiplication since number of columns of A = number of rows of B. Resulting matrix is r x r. Trace of matrix A Tr (A) HISTORY (MX INTRODUCTORY WORKSHOP) Maximum-likelihood estimation using linear covariance structure models, e.g. fitting models to twin data: Let p be the number of observed variables, the expected covariance matrix be E, and the expected vector of means be , where E and  are functions of q free parameters to be estimated from the data. Let x1, x2…xn denote to observed variables. Assuming that the observed variables follow a multivariate normal distribution, the loglikelihood of the observed data is given by n ln L1   n ln2 E   1  xi  μ T E1 xi  μ  2 2 i 1 This is the formula used for maximum likelihood model-fitting to raw continuous data, assuming a multivariate normal distribution. (Often, 2 ln L, is estimated). Requires that we provide: (a) model for expected covariance matrix – e.g. in terms of additive genetic, shared and non-shared environmental variance components – that will vary as a function of relationship; (b) model for expected means – in simplest applications we might estimate a separate mean that might differ by gender, or possibly by twin pair zygosity. EXAMPLE: MZ Pairs: VA  VC  VE  VA  VC  E=  = [ m , m] T DZ Pairs: E= =  VA  VC  VE  1   2 VA  VC VA  VC  VA  VC  VE   1V A 2    VE    VC VA  VC [ m , m]T Where m is estimate of population mean, VA, VC, VE and are additive genetic, shared environmental and non-shared environmental variances, all estimated jointly from the data. Compare e.g. ln L1 for VA VC VE model ln L0 for VC VE model 2(ln L1 – ln L0) distributed as chi-square on one degree of freedom as before. HISTORY (MX INTRODUCTORY WORKSHOP) - II In practice most applications in the MX Introductory Workshop fitted models to summary covariance matrices. We can simplify (e.g. Sham, p. 238): n (1) x  μ T E 1 xi  μ  ln L1   n ln2 E   1  i 1 i 2 2 (2)   n ln2   ln E  m  μ T E 1 m  μ   tr E 1S  2   Where m is the vector of sample means of the p observed variables, S is the observed covariance matrix. For the simple applications in the Introductory Workshop, we had no predictions for the means structure, so we can saturate that component of the model (i.e. estimate a separate mean for every observed mean), equivalent to deleting the term (m)T E1 (m) in (2). Thus the log-likelihood of the observed data becomes (3) ln L1   n ln2   ln E  tr E 1S 2    Under a saturated model, which equates every element of E to the corresponding element of S (i.e. a perfect fit model) we have for the log-likelihood ln L0     n ln2 2  n ln2 2    ln S  tr S 1S   ln S  p   HISTORY (MX INTRODUCTORY WORKSHOP) - II Thus the likelihood-radio test of the fitted model against the saturated model becomes    2ln L0  ln L1   n ln E  ln S  tr E1S  p For multiple group problem, sum over groups.  (4) Analysis of Australian BMI data-young female MZ twins pairs-MX DYI version Analysis of Australian BMI data-young female MZ twins pairs-MX DIY version ! DA NG=1 NI=2 NO=0 begin matrices; A LO 1 1 FR ! Additive genetic variance C LO 1 1 FR ! Shared environmental variance E LO 1 1 FR ! Non-shared environmental variance M FU 2 2 ! This will be observed MZ covariance matrix D FU 2 2 ! This will be observed DZ covariance matrix g fu 1 1 ! coefficient of 0.5 for DZ pairs n fu 1 1 ! sample size for MZ pairs (female in this illustration) k fu 1 1 ! sample size for DZ pairs (female in this illustration) p fu 1 1 ! order of matrices (i.e. number of variables =2 in this case) end matrices; mat g 0.5 mat p 2 mat m 0.7247 0.5891 0.5891 0.7915 mat n 532 mat d 0.7786 0.2461 0.2461 0.8365 mat k 326 mat a 0.25 mat c 0.25 mat e 0.25 Analysis of Australian BMI data-young female MZ twins pairs-MX DYI version (ctd) BEGIN ALGEBRA; t=n+k; ! total sample size U=(A+C+E | A+C _ A+C | A+C+E); ! Expected MZ covariance matrix V=(A+C+E | g*A+C _ g*A+C | A+C+E); ! Expected DZ covariance matrix H=n*(\ln (\det(U))-\ln (\det(M)) + \tr((U~ *M))-p); ! fit-function for MZ group J=k*(\ln (\det(V))-\ln (\det(D)) + \tr((V~ *D))-p); ! fit-function for DZ group F=h+j; END ALGEBRA; bo 0.01 1.0 e(1,1) bo 0.0 1.0 c(1,1) a(1,1) CO F ; option user df=6 end PRE-HISTORY (STATISTICS 101) - II LINEAR REGRESSION: requires weaker assumptions than linear covariance structure models. Does not assume multivariate normal distribution, only homoscedastic residuals. Flexible for handling selective sampling schemes where we oversample extreme values of predictor variable(s). We can fit linear regression models by maximum likelihood e.g. using MX. HISTORY! MX INTRODUCTORY WORKSHOP - III “Definition variables” – an option in MX when fitting to raw data, which allows us to model effects of some variables as fixed effects, modeling their contribution to expected means. Simple example: controlling for linear or polynomial regression of a quantitative measure on age. Don’t want to model covariance structure with age (which probably has rectangular distribution!) Definition variables = variables whose values may vary from individual to individual, that can be read into matrices Important example: genotypes at a given locus or set of loci.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD …