Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genome-wide association studies Usman Roshan Recap • Single nucleotide polymorphism • Genome wide association studies – Relative risk, odds risk (or odds ratio) as an approximation to relative risk – Chi-square test to determine significant SNPs – Logistic regression model for determining odds ratio SNP genotype representation The example F: AACACAATTAGTACAATTATGAC M:AACAGAATTAGTACAATTATGAC is represented as CG CC GG … SNP genotype encoding • If SNP is A/B (alphabetically ordered) then count number of times we see B. • Previous example becomes A/T C/T G/T … A/T C/T G/T H0: AA TT GG … 0 2 0 H1: AT CC GT … =>1 0 1 H2: AA CT GT … 0 1 1 Now we have data in numerical format … … … … Example GWAS Case 1 Case 2 Case 3 Control 1 Control 2 Control 3 A/T AA AT AA TT TT TA C/G CC CG CG GG CC CG A/G … AA AA AA GG GG GG Encoded data A/T A/G Case1 AA Case2 AT Case3 AA Con1 Con2 Con3 CC CG CG TT TT TA AA AA AA GG CC CG C/G => GG GG GG A/G 0 1 0 0 1 1 2 2 1 A/T C/G 0 0 0 2 0 1 2 2 2 Example Odds of AA in case = (80/100)/(20/100) = 4 Odds of AA in control = (70/100)/(30/100) = 7/3 Odds ratio of AA = 4/(7/3) = 12/7 AA AC CC Case 80 15 5 Control 70 15 15 Chi-square statistic 2 (c e ) 2 i i = Define the statistic: ei i=1 n where ci = observed frequency for ith outcome ei = expected frequency for ith outcome n = total outcomes The probability distribution of this statistic is given by the chi-square distribution with n-1 degrees of freedom. Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf Great. But how do we use this to get a SNP p-value? Null hypothesis for case control contingency table • We have two random variables: – – • • Null hypothesis: the two variables are independent of each other (unrelated) Under independence – – – • • P(D,G)= P(D)P(G) P(D=case) = (c1+c2+c3)/n P(G=AA) = (c1+c4)/n Expected values – • D: disease status G: allele type. E(X1) = P(D=case)P(G=risk)n We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value). SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important. AA AC CC Case c1 c2 c3 Control c4 c5 c6 n=c1+c2+c3+c4+c5+c6 Chi-square statistic exercise • Compute expected values and chi-square statistic • Compute chi-square statistic and p-value by referring To chi-square distribution AA AC CC Case 80 15 5 Control 60 15 25 Logistic regression • • The odds ratio estimated from the contingency table directly has a skewed sampling distribution. A better (discriminative) approach is to model the log likelihood ratio log(Pr(G|D=case)/Pr(G|D=control)) as a linear function. In other words: log( • Why: – – • • Pr(G | D case) ) wT G w0 Pr(G | D control) Log likelihood ratio is a powerful statistic Modeling as a linear function yields a simple algorithm to estimate parameters G is number of copies of the risk allele With some manipulation this becomes Pr(D case | G) 1 1 e (w T G w 0 ) How do we get the odds ratio from logistic regression? (I) • Using Bayes rule we have log( Pr(G | D case) ) wT G w0 Pr(G | D control) By exponentiating both sides we get T Pr(G | D case) ew G w0 Pr(G | D control) And by taking the ratio with G=1 and G=0 we get Pr(G 1 | D case) wT 1 w0 Pr(G 1 | D control) e wT 0 w e w 0 Pr(G 0 | D case) e Pr(G 0 | D control) How do we get the odds ratio from logistic regression? (II) Continued from previous slide: by rearranging the terms in the numerator and denominator we get Pr(G 1 | D case) Pr(G 1 | D case) Pr(G 1 | D control) Pr(G 0 | D case) Pr(G 0 | D case) Pr(G 1 | D control) Pr(G 0 | D control) Pr(G 0 | D control) By symmetry of odds ratio this is Pr(G 1 | D case) Pr(D case | G 1) Pr(G 0 | D case) Pr(D control | G 1) OR (odds ratio) Pr(G 1 | D control) Pr(D case | G 0) Pr(G 0 | D control) Pr(D control | G 0) Since the original ratio (see previous slide) is equal to ew and is equal to the odds ratio we conclude that the odds ratio is given by this value. How to find w and w0? • And so ew is our odds ratio. But how do we find w and w0? – We assume that one’s disease status D given their genotype G is a Bernoulli random variable. – Using this we form the sample likelihood – Differentiate the likelihood by w and w0 – Use gradient descent Today • • • • • Basic classification problem Maximum likelihood Logistic regression likelihood Algorithm for logistic regression likelihood Support vector machine Supervised learning for two classes • We are given n training samples (xi,yi) for i=1..n drawn i.i.d from a probability distribution P(x,y). • Each xi is a d-dimensional vector (xiRd) and yi is +1 or -1 • Our problem is to learn a function f(x) for predicting the labels of test samples xi’Rd for i=1..n’ also drawn i.i.d from P(x,y) Loss function • Loss function: c(x,y,f(x)) • Maps to [0,inf] • Examples: 0 if y f (x) c(x, y, f (x)) 1 otherwise 0 if y f (x) c(x, y, f (x)) c'(x) otherwise c(x, y, f (x)) (y f (x))2 Test error • We quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes: 1 n' 2 Rtest [ f ] c(xi ', y j , f (xi '))P(y j | xi ') n' i 1 j 1 • We’d like to find f that minimizes this but we need P(y|x) which we don’t have access to. Expected risk • Suppose we didn’t have test data (x’). Then we average the test error over all possible data points x 2 R[ f ] c(x, y j , f (x))P(y j , x) xX j 1 • We want to find f that minimizes this but we don’t have all data points. We only have training data. Empirical risk • Since we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)). • Solution: we approximate P(x,y) with the empirical distribution pemp(x,y) 1 n pemp (x, y) xi (x) yi (y) n i 1 • The delta function x(y)=1 if x=y and 0 otherwise. Empirical risk • We can now define the empirical risk as 2 Remp [ f ] c(x, y j , f (x))pemp (y j , x) xX j 1 1 n c(xi , yi , f (xi )) n i 1 • Once the loss function is defined and training data is given we can then find f that minimizes this. Example of minimizing empirical risk (least squares) • Suppose we are given n data points (xi,yi) where each xiRd and yiRd. We want to determine a linear function f(x)=ax+b for predicting test points. • Loss function c(xi,yi,f(xi))=(yi-f(xi))2 • What is the empirical risk? Empirical risk for least squares n Remp [ f ] c(xi , yi , f (xi )) i 1 n (yi f (xi ))2 i 1 n (yi axi b))2 i 1 Now finding f has reduced to finding a and b. Since this function is convex in a and b we know there is a global optimum which is easy to find by setting first derivatives to 0. Empirical risk for logistic regression • Recall that logistic regression model: Pr(D case | G) 1 1 e (w T G w0 ) • where G is number of copies of risk allele. • In order to use this to predict one’s risk of disease or to determine the odds ratio we need to know w and w0. • We use maximum likelihood to find w and w0. But it doesn’t yield a simple closed form solution like least squares. Maximum likelihood • We can classify by simply selecting the model M that has the highest P(M|D) where D=data, M=model. Thus classification can also be framed as the problem of finding M that maximizes P(M|D) • By Bayes rule: P(M | D) P(D | M )P(M ) P(D | M )P(M ) P(D) P(D | M )P(M ) M Maximum likelihood • Suppose we have k models to consider and each has the same probability. In other words we have a uniform prior distribution P(M)=1/k. Then P(M | D) P(D | M ) 1 k • P(D | M )P(M ) P(D | M ) M In this case we can solve the classification problem by finding the model that maximizes P(D|M). This is called the maximum likelihood optimization criterion. Maximum likelihood • Suppose we have n i.i.d. samples (xi,yi) drawn from M. The likelihood P(D|M) is P(D | M ) P((x1 , y1 ),...,(xn , yn ) | M ) P(x1 , y1 | M )...P(xn , yn | M ) n n i 1 i 1 P(xi , yi | M ) P(yi | xi , M )P(xi ) • Consequently the log likelihood is n n i 1 i 1 log P(D | M ) log P(yi | xi , M ) P(xi ) Maximum likelihood and empirical risk • Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M)) • Set the loss function to c(xi , yi , f (xi )) log(P(yi | xi , f )) • Now minimizing the empirical risk is the same as maximizing the likelihood Maximum likelihood example • Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p) • We want to determine the probability P(H) of the coin that produced HTHHHTHHHTHTH. • Solution: – Form the log likelihood – Differentiate w.r.t. p – Set to the derivative to 0 and solve for p • How about the probability P(H) of the coin that produces k heads and n-k tails? Classification by likelihood • Suppose we have two classes C1 and C 2. • Compute the likelihoods P(D|C1) and P(D|C2). • To classify test data D’ assign it to class C1 if P(D|C1) is greater than P(D|C2) and C2 otherwise. Logistic regression (in light of disease risk prediction) • We assume that the probability of disease given their genotype is Pr(D case | G) 1 1 e (w T G w 0 ) • where G is the numeric format genotype. • Problem: given training data we want to estimate w and w by maximum likelihood 0 Maximum likelihood for logistic regression • Assume that one’s disease status given their genotype is a Bernoulli random variable with probability P(D|Gi).In other words training sample i is case with probability P(D=case|Gi) and control with probability 1-P(D=case|Gi). • Assume we have m cases and n-m controls. • The likelihood is given by m n i 1 i m 1 P(data | w, w0 ) P(D case | Gi ) (1 P(D case | G)) Maximum likelihood for logistic regression • Likelihood: m n i 1 i m 1 P(data | w, w0 ) P(D case | Gi ) (1 P(D case | G)) • -Log likelihood m log P(data | w, w0 ) log(P(D case | Gi )) i 1 n log(1 P(D case | G )) i i m 1 n 1 1 log log 1 T T 1 e(w Gi w0 ) i m 1 1 e(w Gi w0 ) i 1 m • Set first derivates to 0 and solve for w and w0. • No closed form therefore have to use gradient descent. Gradient descent • Given a convex function f(x,y) how can we find x and y that minimizes this? • Solution is gradient descent – The gradient vector points in the direction of greatest increase in the function. – Therefore solution is to move in small increments in the negative direction of the gradient until we reach the optimum. Disease risk prediction • How exactly do we predict risk? – Personal genomics companies: Composite odds ratio score – Academia: Composite odds ratio score and recently other classifiers as well – Still an open problem: depends upon classifier and the set of SNPs Composite odds ratio score • Recall that we can obtain the odds ratio from the logistic regression model Pr(D case | G 1) wT 1 w0 e 1 Pr(D case | G 1) OR wT 0 w e w 0 Pr(D case | G 0) e 1 Pr(D case | G 0) • Define =ew • Now we can predict the risk with n alleles R(G1,G2 ,...,Gn ) 1G1 2G2 ...nGn Example of risk prediction study (type 1 diabetes) Example of risk prediction study (arthritis)