Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applied statistics Usman Roshan A few basic stats • Expected value of a random variable – – example of Bernoulli and Binomal • Variance of a random variable – example of Bernoulli and Binomial • Correlation coefficient (same as Pearson correlation coefficient) • Formulas: – Covariance(X,Y) = E((X-μX)(Y-μY)) – Correlation(X,Y)= Covariance(X,Y)/σXσY – Pearson correlation Correlation between variables • Measures the correlation between two variables • The correlation r is between -1 and 1. A value of 1 means perfect positive correlation and -1 in the other direction • The function f(r) has a t-distribution with n-2 df that can be used to obtain a p-value n-2 f (r) = r 2 1- r Pearson correlation coefficient From Wikipedia Basic stats in R • Mean and variance calculation – Define list and compute mean and variance • Correlations – Define two lists and compute correlation Statistical inference • • • • • P-value Bayes rule Posterior probability and likelihood Bayesian decision theory Bayesian inference under Gaussian distribution • Chi-square test, Pearson correlation coefficient, t-test P-values • What is a p-value? – It is the probability of your estimate assuming the data is coming from some null distribution – For example if your estimate of mean is 1 and the true mean is 0 and is normally distributed what is the p-value of your estimate? – It is the area under curve of the normal distribution for all values of mean that are at least your estimate • A small p-value means the probability that the data came from the null distribution is small and thus the null distribution could be rejected. • A large p-value supports the null distribution but may also support other distributions P-values from Gaussian distributions P(x | C1 ) = 1 2ps 1 e - ( x- m1 )2 2s 12 Courtesy of Wikipedia P-values from chi-square distributions Courtesy of Wikipedia Distributions in R • Binomial distribution in R – dbinom,pbinom • Gaussian (normal) distribution in R – pnorm • Calculating p-values in R – Suppose true mean is 0 – Your estimated mean is 1 – What is the p-value of your estimate? • Example problem – Suppose your estimate of NJ mean age is 30 out of 100 people but true mean and standard deviation is 25 and 20. Central limit says sample mean is normally distributed. We use this to determine p-value of our estimate Type 1 and type 2 errors Courtesy of WIkipedia Bayes rule • Fundamental to statistical inference • Conditional probability • Posterior = (Likelihood * Prior) / Normalization P( x | M ) P( M ) P( x | M ) P( M ) P( M | x) P( x) P( x | M ) P( M ) M Hypothesis testing • We can use Bayes rule to help make decisions • An outcome or action is described by a model • Given two models we pick the one with the higher probability • Coin toss example: use likelihood to determine which coin generated the tosses Likelihood example • Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p) • We are given some tosses (training data): HTHHHTHHHTHTH. • Was the above sequence produced by a fair coin? – What is the probability that a fair coin produced the above sequence of tosses? – What is the p-value of your sequence of tosses assuming the coin is fair? This is the same as asking what is the probability that a fair coin generates 9 or more heads out of 13 heads. Let’s start with exactly. Solve it with R. • Was the above sequence more likely to be produced by a biased coin 1 (p=0.85) or a biased coin 2 (p=.75)? • Solution: – Calculate the likelihood (probability) of the data with each coin • Alternatively we can ask which coin maximizes the likelihood? Maximum likelihood example • Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p) • We want to determine the probability P(H) of the coin that produces k heads and n-k tails? • We are given some tosses (training data): HTHHHTHHHTHTH. • Solution: – Form the log likelihood – Differentiate w.r.t. p – Set to the derivative to 0 and solve for p Maximum likelihood example Likelihood inference • Assume data is generated by a Gaussian distribution whose mean and variance are unknown P(x | C1 ) = 1 2ps 1 e - ( x- m1 )2 2s 12 P(x | C2 ) = 1 2ps 2 e - ( x- m2 )2 2s 22 Gaussian models • Assume that class likelihood is represented by a Gaussian distribution with parameters μ (mean) and σ (standard deviation) P(x | C1 ) = 1 2ps 1 e - ( x- m1 )2 2s 12 P(x | C2 ) = 1 2ps 2 e - ( x- m2 )2 2s 22 • We find the model (in other words mean and variance) that maximize the likelihood (or equivalently the log likelihood). Suppose we are given training points x1,x2,…,xn1 from class C1. Assuming that each datapoint is drawn independently from C1 the sample log likelihood is n1 å( xi - m1 )2 P(x1 , x2 ..., xn1 | C1 ) = P(x1 | C1 )P(x2 | C1 )...P(xn1 | C1 ) = 1 n1 2p s 1 e - i=1 2s 12 Gaussian models • The log likelihood is given by log(P(x1 , x2 ..., xn1 | C1 )) = - n1 n1 log(2p ) - n1log(s 1 ) 2 2 (x m ) å i 1 i=1 2s 12 • By setting the first derivatives dP/dμ1 and dP/dσ1 to 0. This gives us the maximum likelihood estimate of μ1 and σ1 (denoted as m1 and s1 respectively) n1 n1 m1 xi i 1 n1 s12 = 2 (x m ) å i 1 i=1 • Similarly we determine m2 and s2 for class C2. n1 Gaussian classification example • Consider one dimensional data for two classes (SNP genotypes for case and control subjects). – Case (class C1): 1, 1, 2, 1, 0, 2 – Control (class C2): 0, 1, 0, 0, 1, 1 • Under the Gaussian assumption case and control classes are represented by Gaussian distributions with parameters (μ1, σ1) and (μ2, σ2) respectively. The maximum likelihood estimates of means are n1 m1 x i 1 n1 i 11 2 1 0 2 7/6 6 m2 0 1 0 0 11 3/ 6 6 Gaussian classification example • The estimates of class standard deviations are n1 s1 • • (x m ) i 1 i n1 1 2 (1 7 / 6) 2 (1 7 / 6) 2 (2 7 / 6) 2 (1 7 / 6) 2 (0 7 / 6) 2 (2 7 / 6) 2 .47 6 Similarly s2=.25 Which class does x=1 belong to? What about x=0 and x=2? ( xi m1 )2 1 log( P( x | C1 )) log(2 ) log( s1 ) 2 2s12 ( xi m2 )2 1 log( P( x | C2 )) log(2 ) log( s2 ) 2 2s22 • What happens if class variances are equal? Multivariate Gaussian classification • Suppose each datapoint is an m-dimensional vector. In the previous example we would have m SNP genotypes instead of one. The class likelihood is given by P( x | C1 ) 1 (2 ) d /2 1 1/2 e 1 ( x 1 )T 11 ( x 1 ) 2 • Where Σ1 is the class covariance matrix. Σ1 is of dimensiona d x d. The (i,j)th entry of Σ1 is the covariance of the ith and jth variable. Multivariate Gaussian classification • The maximum likelihood estimates of η1 and Σ1 are n1 m1 xi i 1 n1 n1 S1 T ( x m )( x m ) i 1 i 1 i 1 n1 • The class log likelihoods with estimated parameters (ignoring constant terms) are 1 1 log( P( x | C1 )) log( S1 ) ( x m1 )T S11 ( x m1 ) 2 2 1 1 log( P( x | C2 )) log( S 2 ) ( x m2 )T S 21 ( x m2 ) 2 2 Naïve Bayes algorithm • If we assume that variables are independent (no interaction between SNPs) then the offdiagonal terms of S are zero and the log likelihood becomes (ignoring constant terms) 1 m 1 m æ x j - m1 j ö log( P(x | C1 )) = - Õ s1 j - å ç ÷ 2 j=1 2 j=1 è s1 j ø 2 Multivariate Gaussian classification • If S1=S2 then the class log likelihoods with estimated parameters (ignoring constant terms) are • Depends on distance to means. Nearest means classifier • If we assume all variances sj to be equal then (ignoring constant terms) we get 1 m 2 log( P( x | C1 )) 2 ( x j m1 j ) 2s j 1 Gaussian classification example • Consider three SNP genotype for case and control subjects. – Case (class C1): (1,2,0), (2,2,0), (2,2,0), (2,1,1), (0,2,1), (2,1,0) – Control (class C2): (0,1,2), (1,1,1), (1,0,2), (1,0,0), (0,0,2), (0,1,0) • Classify (1,2,1) and (0,0,1) with the nearest means classifier Chi-square test • • • • • • • We have two random variables: – Label (L): 0 or 1 – Feature (F): Categorical Null hypothesis: the two variables are independent of each other (unrelated) Label=0 Under independence – P(L,F)= P(D)P(G) – P(L=0) = (c1+c2)/n – P(F=A) = (c1+c3)/n Label=1 Expected values – E(X1) = P(L=0)P(F=A)n We can calculate the chi-square statistic for a given feature and the probability that it is independent of the label (using the p-value). We look up chi-square value in distribution with 2 degrees of freedom = (cols-1)*(rows-1) to get pvalue Features with very small probabilities deviate significantly from the independence assumption and therefore considered important. Contingency table Feature=A Feature=B Observed=c1 Expected=X1 Observed=c2 Expected=X2 Observed=c3 Expected=X3 Observed=c4 Expected=X4 (ei - ci ) c =å ei i 2