Download Maximum Likelihood Estimation for Allele Frequencies

Maximum Likelihood Estimation for Allele Frequencies Biostatistics 666 Lecture 7 Last Three Lectures: Introduction to Coalescent Models z Computationally efficient framework • Alternative to forward simulations z Predictions about sequence variation • Number of polymorphisms • Frequency of polymorphisms Coalescent Models: Key Ideas z Proceed backwards in time z Genealogies shaped by • Population size • Population structure • Recombination rates z Given a particular genealogy ... • Mutation rate predicts variation Next Series of Lectures z Estimating allele and haplotype frequencies from genotype data • Maximum likelihood approach • Application of an E-M algorithm z Challenges • Using information from related individuals • Allowing for non-codominant genotypes • Allowing for ambiguity in haplotype assignments Objective: Parameter Estimation z Learn about population characteristics • E.g. allele frequencies, population size z Using a specific sample • E.g. a set sequences, unrelated individuals, or even families Maximum Likelihood z A general framework for estimating model parameters z Find the set of parameter values that maximize the probability of the observed data z Applicable to many different problems Example: Allele Frequencies z Consider… • A sample of n chromosomes • X of these are of type “a” • Parameter of interest is allele frequency… ⎛n⎞ X L( p | n, X ) = ⎜⎜ ⎟⎟ p (1 − p ) n − X ⎝X⎠ Evaluate for various parameters p 1-p L 0.0 1.0 0.000 0.2 0.8 0.088 0.4 0.6 0.251 0.6 0.4 0.111 0.8 0.2 0.006 1.0 0.0 0.000 Likelihood Plot 0.4 For n = 10 and X = 4 Likelihood 0.3 0.2 0.1 0 0.0 0.2 0.4 0.6 Allele Frequency 0.8 1.0 In this case z The likelihood tells us the data is most probable if p = 0.4 z The likelihood curve allows us to evaluate alternatives… • Is p = 0.8 a possibility? • Is p = 0.2 a possibility? Example: Estimating 4Nµ z Consider S polymorphisms in sample of n sequences… L(θ | n, S ) = Pn ( S | θ ) z Where Pn is calculated using the Qn and P2 functions defined previously Likelihood Plot Likelihood With n = 5, S = 10 MLE 4Nµ Maximum Likelihood Estimation z Two basic steps… a) Write down likelihood function L(θ | x) ∝ f ( x | θ ) b) Find value of θˆ that maximizes L(θ | x) z In principle, applicable to any problem where a likelihood function exists MLEs z Parameter values that maximize likelihood • θ where observations have maximum probability z Finding MLEs is an optimization problem z How do MLEs compare to other estimators? Comparing Estimators z How do MLEs rate in terms of … • Unbiasedness • Consistency • Efficiency z For a review, see Garthwaite, Jolliffe, Jones (1995) Statistical Inference, Prentice Hall Analytical Solutions z Write out log-likelihood … l(θ | data ) = ln L(θ | data ) z Calculate derivative of likelihood dl(θ | data ) dθ z Find zeros for derivative function Information z The second derivative is also extremely useful ⎡ d 2 l(θ | data ) ⎤ Iθ = − E ⎢ ⎥ 2 θ d ⎣ ⎦ 1 Vθˆ = Iθ z z The speed at which log-likelihood decreases Provides an asymptotic variance for estimates Allele Frequency Estimation … z When individual chromosomes are observed this does not seem tricky… z What about with genotypes? z What about with parent-offspring pairs? Coming up … z We will walk through allele frequency estimation in three distinct settings: • Samples single chromosomes … • Samples of unrelated Individuals … • Samples of parents and offspring … I. Single Alleles Observed z Consider… • A sample of n chromosomes • X of these are of type “a” • Parameter of interest is allele frequency… ⎛n⎞ X L( p | n, X ) = ⎜⎜ ⎟⎟ p (1 − p ) n − X ⎝X⎠ Some Notes z The following two likelihoods are just as good: ⎛n⎞ X L( p; X , n) = ⎜⎜ ⎟⎟ p (1− p ) n− X ⎝X⎠ n L( p; x1 , x2 ...xn , n) = ∏ p xi (1− p )1−xi i=1 z For ML estimation, constant factors in likelihood don’t matter Analytic Solution z z The log-likelihood ⎛n⎞ ln L(θ | n, X ) = ln⎜⎜ ⎟⎟ + X ln p + (n − X ) ln(1 − p ) ⎝X⎠ The derivative d ln L( p | X ) X n − X = − dp p 1− p z Find zero … Samples of Individual Chromosomes z The natural estimator (where we count the proportion of sequences of a particular type) and the MLE give identical solutions z Maximum likelihood provides a justification for using the “natural” estimator II. Genotypes Observed Genotype Observed Frequency A1A1 n11 p11 Genotypes A1A2 A2A2 n12 n22 p12 p22 Total n=n11+n12+n22 1.0 Alleles Genotype A1 Observed Frequency n1=2n11+n12 p1=n1/2n A2 n2=2n22+n12 p2=n2/2n Total 2n=n1+n2 1.0 Consider a Set of Genotypes… z Use notation nij to denote the number of individuals with genotype i / j z Sample of n individuals Genotype Observed Frequency A1A1 n11 p11 Genotypes A1A2 A2A2 n12 n22 p12 p22 Total n=n11+n12+n22 1.0 Allele Frequencies by Counting… z A natural estimate for allele frequencies is to calculate the proportion of individuals carrying each allele Alleles Genotype A1 Observed Frequency n1=2n11+n12 p1=n1/2n A2 n2=2n22+n12 p2=n2/2n Total 2n=n1+n2 1.0 MLE using genotype data… z Consider a sample such as ... Genotype Observed z A1A1 n11 Genotypes A1A2 A2A2 n12 n22 Total n=n11+n12+n22 The likelihood as a function of allele frequencies is … n! n11 n12 n22 ( p ² ) (2 pq ) (q² ) L ( p; n ) = n11! n12! n22! Which gives… z Log-likelihood and its derivative l = ln L = (2n11 + n12 ) ln p1 + (2n22 + n12 ) ln(1 − p1 ) + C dl 2n11 + n12 2n22 + n12 = − dp1 p1 (1 − p1 ) z Giving the MLE as … pˆ1 = (2n11 + n12 ) 2(n11 + n12 + n22 ) Samples of Unrelated Individuals z Again, natural estimator (where we count the proportion of alleles of a particular type) and the MLE give identical solutions z Maximum likelihood provides a justification for using the “natural” estimator III. Parent-Offspring Pairs Parent A1A1 Child A1A2 A1A1 a1 a2 0 a1+a2 A1A2 a3 a4 a5 a3+a4+a5 A2A2 0 a6 a7 a6+a7 a1+a3 a2+a4+a6 a5+a7 N pairs A2A2 Probability for Each Observation Parent A1A1 Child A1A2 A2A2 A1A1 A1A2 A2A2 1.0 Probability for Each Observation Parent A1A1 Child A1A2 A1A1 p13 p12p2 0 p12 A1A2 p12p2 p1p2 p1p22 2p1p2 A2A2 0 p1p22 p23 p22 p12 2p1p2 p22 1.0 A2A2 Which gives… ln L = p2 = 1 − p1 B = 3a1 + 2(a2 + a3 ) + a4 + (a5 + a6 ) C = (a2 + a3 ) + a4 + 2(a5 + a6 ) + 3a7 pˆ1 = B (B + C ) Which gives… ( ) ln L = a1 ln p13 + (a2 + a3 ) ln p12 p2 + a4 ln( p1 p2 ) ( ) + (a5 + a6 ) ln p1 p22 + a7 ln p23 + constant = B ln p1 + C ln(1 − p1 ) p2 = 1 − p1 B = 3a1 + 2(a2 + a3 ) + a4 + (a5 + a6 ) C = (a2 + a3 ) + a4 + 2(a5 + a6 ) + 3a7 pˆ 1 = B (B + C ) Samples of Parent Offspring-Pairs z The natural estimator (where we count the proportion of alleles of a particular type) and the MLE no longer give identical solutions z In this case, we expect the MLE to be more accurate Comparing Sampling Strategies z We can compare sampling strategies by calculate the information for each one ⎡ d 2 l(θ | data ) ⎤ Iθ = − E ⎢ ⎥ 2 θ d ⎣ ⎦ 1 Vθˆ = Iθ z Which one to you expect to be most informative? How informative is each setting? pq z Single chromosomes Var ( p ) = N z z Unrelated individuals pq Var ( p) = 2N Parent offspring trios pq Var ( p) = 3 N − a4 Other Likelihoods z Allele frequencies when individuals are… • Diagnosed for Mendelian disorder • Genotyped at two neighboring loci • Phenotyped for the ABO blood groups z z Many other interesting problems… … but some have no analytical solution Today’s Summary z Examples of Maximum Likelihood z Allele Frequency Estimation • Allele counts • Genotype counts • Pairs of Individuals Take home reading z Excoffier and Slatkin (1996) z Introduces the E-M algorithm Widely used for maximizing likelihoods in genetic problems z Properties of Estimators For Review Unbiasedness z An estimator is unbiased if E (θˆ) = θ bias(θˆ) = E (θˆ) − θ z z Multiple unbiased estimators may exist Other properties may be desirable Consistency z An estimator is consistent if ( ) P | θˆ − θ |> ε → 0 as n → ∞ z z for any ε Estimate converges to true value in probability with increasing sample size Mean Squared Error z MSE is defined as { } 2 ⎛ ˆ ˆ MSE (θ ) = E ⎜ (θ − θ ) + (θ − θ ) ⎞⎟ ⎝ ⎠ = var(θˆ) + bias(θˆ) 2 z If MSE → 0 as n → ∞ then the estimator must be consistent • The reverse is not true Efficiency z The relative efficiency of two estimators is the ratio of their variances var(θˆ2 ) if > 1 then θˆ1 is more efficient var(θˆ1 ) z Comparison only meaningful for estimators with equal biases Sufficiency z Consider… • • z Observations X1, X2, … Xn Statistic T(X1, X2, … Xn) T is a sufficient statistic if it includes all information about θ in the sample • • Distribution of Xi conditional on T is independent of θ Posterior distribution of θ conditional on T is independent of Xi Minimal Sufficient Statistic z There can be many alternative sufficient statistics. z A statistic is a minimal sufficient statistic if it can be expressed as a function of every other sufficient statistic. Typical Properties of MLEs z Bias • z Consistency • z Typically, MLEs are asymptotically efficient estimators Sufficiency • z Subject to regularity conditions, MLEs are consistent Efficiency • z Can be biased or unbiased Often, but not always Cox and Hinkley, 1974 Strategies for Likelihood Optimization For Review Generic Approaches z Suitable for when analytical solutions are impractical z Bracketing Simplex Method Newton-Rhapson z z Bracketing z Find 3 points such that • • z θa < θ b < θ c L(θb) > L(θa) and L(θb) > L(θc) Search for maximum by • Select trial point in interval • Keep maximum and flanking points Bracketing 6 5 3 4 1 2 The Simplex Method z Calculate likelihoods at simplex vertices • Geometric shape with k+1 corners • E.g. a triangle in k = 2 dimensions z At each step, move the high vertex in the direction of lower points The Simplex Method II Original Simplex high reflection reflection and expansion low contraction multiple contraction One parameter maximization z Simple but inefficient approach z Consider • Parameters θ = (θ1, θ2, …, θk) • Likelihood function L (θ; x) z Maximize θ with respect to each θi in turn • Cycle through parameters The Inefficiency… θ2 θ1 Steepest Descent z Consider • Parameters θ = (θ1, θ2, …, θk) • Likelihood function L (θ; x) z Score vector • d ln( L) ⎛ d ln( L) d ln( L) ⎞ ⎟⎟ S= , ..., = ⎜⎜ dθ dθ k ⎠ ⎝ dθ1 Find maximum along θ + δS Still inefficient… Consecutive steps are perpendicular! Local Approximations to Log-Likelihood Function In the neighboorhood of θi l(θ) ≈ l(θi ) + S (θ − θ i ) − 1 (θ − θ i ) t Iθ (θ − θi ) 2 where is the loglikelihood function l(θ) = ln L(θ) is the score vector S = dl(θi ) Iθ = −d ²l(θ i ) is the observed information matrix Newton’s Method Maximize the approximation l(θ) ≈ l(θi ) + S(θ − θ i ) − 1 (θ − θi ) t I (θ − θ i ) 2 by setting its derivative to zero... S − I (θ − θ i ) = 0 and get a new trial point −1 θ i +1 = θ i + I S Fisher Scoring z Use expected information matrix instead of observed information: ⎡ d 2 l(θ ) ⎤ E ⎢− 2 ⎥ d θ ⎣ ⎦ instead of d 2 l(θ | data ) − dθ 2 Compared to Newton-Rhapson: Converges faster when estimates are poor. Converges slower when close to MLE.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Maximum Likelihood Estimation for Allele Frequencies