Download Applied statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Applied statistics
Usman Roshan
A few basic stats
• Expected value of a random variable –
– example of Bernoulli and Binomal
• Variance of a random variable
– example of Bernoulli and Binomial
• Correlation coefficient (same as Pearson
correlation coefficient)
• Formulas:
– Covariance(X,Y) = E((X-μX)(Y-μY))
– Correlation(X,Y)= Covariance(X,Y)/σXσY
– Pearson correlation
Correlation between variables
• Measures the correlation between two
variables
• The correlation r is between -1 and 1. A value
of 1 means perfect positive correlation and -1
in the other direction
• The function f(r) has a t-distribution with n-2
df that can be used to obtain a p-value
n-2
f (r) = r
2
1- r
Pearson correlation coefficient
From Wikipedia
Basic stats in R
• Mean and variance calculation
– Define list and compute mean and variance
• Correlations
– Define two lists and compute correlation
Statistical inference
•
•
•
•
•
P-value
Bayes rule
Posterior probability and likelihood
Bayesian decision theory
Bayesian inference under Gaussian
distribution
• Chi-square test, Pearson correlation
coefficient, t-test
P-values
• What is a p-value?
– It is the probability of your estimate assuming the data is
coming from some null distribution
– For example if your estimate of mean is 1 and the true
mean is 0 and is normally distributed what is the p-value of
your estimate?
– It is the area under curve of the normal distribution for all
values of mean that are at least your estimate
• A small p-value means the probability that the data
came from the null distribution is small and thus the
null distribution could be rejected.
• A large p-value supports the null distribution but may
also support other distributions
P-values from Gaussian distributions
P(x | C1 ) =
1
2ps 1
e
-
( x- m1 )2
2s 12
Courtesy of Wikipedia
P-values from chi-square distributions
Courtesy of Wikipedia
Distributions in R
• Binomial distribution in R – dbinom,pbinom
• Gaussian (normal) distribution in R – pnorm
• Calculating p-values in R
– Suppose true mean is 0
– Your estimated mean is 1
– What is the p-value of your estimate?
• Example problem
– Suppose your estimate of NJ mean age is 30 out of 100
people but true mean and standard deviation is 25 and 20.
Central limit says sample mean is normally distributed. We
use this to determine p-value of our estimate
Type 1 and type 2 errors
Courtesy of WIkipedia
Bayes rule
• Fundamental to statistical inference
• Conditional probability
• Posterior = (Likelihood * Prior) / Normalization
P( x | M ) P( M )
P( x | M ) P( M )
P( M | x) 

P( x)
 P( x | M ) P( M )
M
Hypothesis testing
• We can use Bayes rule to help make decisions
• An outcome or action is described by a model
• Given two models we pick the one with the
higher probability
• Coin toss example: use likelihood to
determine which coin generated the tosses
Likelihood example
• Consider a set of coin tosses produced by a coin with P(H)=p
(P(T)=1-p)
• We are given some tosses (training data): HTHHHTHHHTHTH.
• Was the above sequence produced by a fair coin?
– What is the probability that a fair coin produced the above sequence
of tosses?
– What is the p-value of your sequence of tosses assuming the coin is
fair? This is the same as asking what is the probability that a fair coin
generates 9 or more heads out of 13 heads. Let’s start with exactly.
Solve it with R.
• Was the above sequence more likely to be produced by a
biased coin 1 (p=0.85) or a biased coin 2 (p=.75)?
• Solution:
– Calculate the likelihood (probability) of the data with each coin
• Alternatively we can ask which coin maximizes the likelihood?
Maximum likelihood example
• Consider a set of coin tosses produced by a coin with
P(H)=p (P(T)=1-p)
• We want to determine the probability P(H) of the
coin that produces k heads and n-k tails?
• We are given some tosses (training data):
HTHHHTHHHTHTH.
• Solution:
– Form the log likelihood
– Differentiate w.r.t. p
– Set to the derivative to 0 and solve for p
Maximum likelihood example
Likelihood inference
• Assume data is generated by a Gaussian
distribution whose mean and variance are
unknown
P(x | C1 ) =
1
2ps 1
e
-
( x- m1 )2
2s 12
P(x | C2 ) =
1
2ps 2
e
-
( x- m2 )2
2s 22
Gaussian models
• Assume that class likelihood is represented by a Gaussian distribution with
parameters μ (mean) and σ (standard deviation)
P(x | C1 ) =
1
2ps 1
e
-
( x- m1 )2
2s 12
P(x | C2 ) =
1
2ps 2
e
-
( x- m2 )2
2s 22
• We find the model (in other words mean and variance) that maximize the
likelihood (or equivalently the log likelihood). Suppose we are given
training points x1,x2,…,xn1 from class C1. Assuming that each datapoint is
drawn independently from C1 the sample log likelihood is
n1
å( xi - m1 )2
P(x1 , x2 ..., xn1 | C1 ) = P(x1 | C1 )P(x2 | C1 )...P(xn1 | C1 ) =
1
n1
2p s 1
e
- i=1
2s 12
Gaussian models
• The log likelihood is given by
log(P(x1 , x2 ..., xn1 | C1 )) = -
n1
n1
log(2p ) - n1log(s 1 ) 2
2
(x
m
)
å i 1
i=1
2s 12
• By setting the first derivatives dP/dμ1 and dP/dσ1 to 0. This gives us the
maximum likelihood estimate of μ1 and σ1 (denoted as m1 and s1
respectively)
n1
n1
m1 
 xi
i 1
n1
s12 =
2
(x
m
)
å i 1
i=1
• Similarly we determine m2 and s2 for class C2.
n1
Gaussian classification example
• Consider one dimensional data for two classes (SNP
genotypes for case and control subjects).
– Case (class C1): 1, 1, 2, 1, 0, 2
– Control (class C2): 0, 1, 0, 0, 1, 1
• Under the Gaussian assumption case and control classes are
represented by Gaussian distributions with parameters (μ1,
σ1) and (μ2, σ2) respectively. The maximum likelihood
estimates of means are
n1
m1 
x
i 1
n1
i
11 2 1 0  2

7/6
6
m2 
0 1 0  0 11
 3/ 6
6
Gaussian classification example
•
The estimates of class standard deviations are
n1
s1 
•
•
 (x  m )
i 1
i
n1
1
2
(1  7 / 6) 2  (1  7 / 6) 2  (2  7 / 6) 2  (1  7 / 6) 2  (0  7 / 6) 2  (2  7 / 6) 2

 .47
6
Similarly s2=.25
Which class does x=1 belong to? What about x=0 and x=2?
( xi  m1 )2
1
log( P( x | C1 ))   log(2 )  log( s1 ) 
2
2s12
( xi  m2 )2
1
log( P( x | C2 ))   log(2 )  log( s2 ) 
2
2s22
•
What happens if class variances are equal?
Multivariate Gaussian classification
• Suppose each datapoint is an m-dimensional vector. In the
previous example we would have m SNP genotypes instead of
one. The class likelihood is given by
P( x | C1 ) 
1
(2 )
d /2
1
1/2
e
1
 ( x  1 )T 11 ( x  1 )
2
• Where Σ1 is the class covariance matrix. Σ1 is of dimensiona d
x d. The (i,j)th entry of Σ1 is the covariance of the ith and jth
variable.
Multivariate Gaussian classification
• The maximum likelihood estimates of η1 and Σ1 are
n1
m1 
 xi
i 1
n1
n1
S1 
T
(
x

m
)(
x

m
)
 i 1 i 1
i 1
n1
• The class log likelihoods with estimated parameters (ignoring constant
terms) are
1
1
log( P( x | C1 ))   log( S1 )  ( x  m1 )T S11 ( x  m1 )
2
2
1
1
log( P( x | C2 ))   log( S 2 )  ( x  m2 )T S 21 ( x  m2 )
2
2
Naïve Bayes algorithm
• If we assume that variables are independent
(no interaction between SNPs) then the offdiagonal terms of S are zero and the log
likelihood becomes (ignoring constant terms)
1 m
1 m æ x j - m1 j ö
log( P(x | C1 )) = - Õ s1 j - å ç
÷
2 j=1
2 j=1 è s1 j ø
2
Multivariate Gaussian classification
• If S1=S2 then the class log likelihoods with
estimated parameters (ignoring constant
terms) are
• Depends on distance to means.
Nearest means classifier
• If we assume all variances sj to be equal then
(ignoring constant terms) we get
1 m
2
log( P( x | C1 ))   2  ( x j  m1 j )
2s j 1
Gaussian classification example
• Consider three SNP genotype for case and control subjects.
– Case (class C1): (1,2,0), (2,2,0), (2,2,0), (2,1,1), (0,2,1), (2,1,0)
– Control (class C2): (0,1,2), (1,1,1), (1,0,2), (1,0,0), (0,0,2), (0,1,0)
• Classify (1,2,1) and (0,0,1) with the nearest means classifier
Chi-square test
•
•
•
•
•
•
•
We have two random variables:
– Label (L): 0 or 1
– Feature (F): Categorical
Null hypothesis: the two variables are
independent of each other (unrelated)
Label=0
Under independence
– P(L,F)= P(D)P(G)
– P(L=0) = (c1+c2)/n
– P(F=A) = (c1+c3)/n
Label=1
Expected values
– E(X1) = P(L=0)P(F=A)n
We can calculate the chi-square statistic for a
given feature and the probability that it is
independent of the label (using the p-value).
We look up chi-square value in distribution with
2
degrees of freedom = (cols-1)*(rows-1) to get pvalue
Features with very small probabilities deviate
significantly from the independence assumption
and therefore considered important.
Contingency table
Feature=A
Feature=B
Observed=c1
Expected=X1
Observed=c2
Expected=X2
Observed=c3
Expected=X3
Observed=c4
Expected=X4
(ei - ci )
c =å
ei
i
2