Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 3: Statistics Review I
Date: 9/3/02
Distributions
Likelihood
Hypothesis tests
Sources of Variation
 Definition: Sampling variation results
because we only sample a fraction of the full
population (e.g. the mapping population).
 Definition: There is often substantial
experimental error in the laboratory
procedures used to make measurements.
Sometimes this error is systematic.
Parameters vs. Estimates
 Definition: The population is the complete
collection of all individuals or things you
wish to make inferences about it. Statistics
calculated on populations are parameters.
 Definition: The sample is a subset of the
population on which you make
measurements. Statistics calculated on
samples are estimates.
Types of Data
 Definition: Usually the data is discrete,
meaning it can take on one of countably
many different values.
 Definition: Many complex and economically
valuable traits are continuous. Such traits are
quantitative and the random variables
associated with them are continuous (QTL).
Random
We are concerned with the outcome of
random experiments.
 production of gametes
 union of gametes (fertilization)
 formation of chiasmata and recombination
events
Set Theory I
Set theory underlies probability.
 Definition: A set is a collection of objects.
 Definition: An element is an object in a set.
 Notation: sS  “s is an element in S”
 Definition: If A and B are sets, then A is a
subset of B if and only if sA implies sB.
 Notation: AB  “A is a subset of B”
Set Theory II
 Definition: Two sets A and B are equal if
and only if AB and BA. We write A=B.
 Definition: The universal set is the superset
of all other sets, i.e. all other sets are
included within it. Often represented as.
 Definition: The empty set contains no
elements and is denoted as.
Sample Space & Event
 Definition: The sample space for a random
experiment is the set  that includes all
possible outcomes of the experiment.
 Definition: An event is a set of possible
outcomes of the experiment. An event E is
said to happen if any one of the outcomes in
E occurs.
Example: Mendel I
 Mendel took inbred lines of smooth AA and
wrinkled BB peas and crossed them to make the F1
generation and again to make the F2 generation.
Smooth A is dominant to B.
 The random experiment is the random production
of gametes and fertilization to produce peas.
 The sample space of genotypes for F2 is AA, BB,
AB.
Random Variable
 Definition: A function from set S to set T is a
rule assigning to each sS, an element tT.
 Definition: Given a random experiment on
sample space , a function from  to T is a
random variable. We often write X, Y, or Z.
If we were very careful, we’d write X(s).
 Simply, X is a measurement of interest on the
outcome of a random experiment.
Example: Mendel II
 Let X be the number of A alleles in a
randomly chosen genotype. X is a random
variable.
 Sample space is  = {0, 1, 2}.
Discrete Probability
Distribution
 Suppose X is a random variable with possible
outcomes {x1, x2, …, xm}. Define the
discrete probability distribution for random
variable X as
with
P X  xi  x  xi
p X x   
0
x  xi
 p x   1
i 1
X
i
Example: Mendel III
p X 0  0.25
p X 1  0.50
p X 2  0.25
p X otherwise   0
Cumulative Distribution
 The discrete cumulative distribution function
is defined as
P X  xi   F xi    P X  x 
x  xi
 The continuous cumulative distribution
function is defined as
F x   P X  x    f u du
x
Continuous Probability
Distribution
dF ( x)
F ' ( x) 
 f ( x)
dx
 If
exists, then f(x) is the
continuous probability distribution. As in the
discrete case,
f u du  1
Expectation and Variance
  xi p x xi  for discrete random variable
 xi 
E X    
 uf u du for continuous random variable
 
  xi  Exi 2 p x xi  for discrete random variable
 xi 
Var  X    
 u  Eu 2 f u du for continuous random variable
 
Moments and MGF
 Definition: The rth moment of X is E(Xr).
 Definition: The moment generating function
is defined as E(etX).
 
mgf  X   E etX
  etxi p x xi  for discrete random variable
 xi 
 
 etu f u du for continuous random variable
 
Example: Mendel IV
 Define the random variable Z as follows:
0 if seed is smooth
Z 
1 if seed is wrinkled
 If we hypothesize that smooth dominates
wrinkled in a single-locus model, then the
corresponding probability model is given by:
Example: Mendel V
PZ  0   3
4
pZ  z   
1
P
Z
1
4
EZ   3  0  1 1  1
4
4
4
Var Z   0  1  3  1  1  1  3
4
4
4
4
16
2
2
Joint and Marginal Cumulative
Distributions
 Definition: Let X and Y be two random
variables. Then the joint cumulative
distribution is F x, y  P X  x, Y  y 
 Definition: The marginal cumulative
distribution is
 P X  x, Y  y  for discrete random variables
y
FX x   F x    x 
   f u, v dvdu for continuous random variables
  
Joint Distribution
 Definition: The joint distribution is
px, y   P X  x, Y  y 
 As before, the sum or integral over the
sample space sums to 1.
Conditional Distribution
 Definition: The conditional distribution of X
given that Y=y is
p  x, y 
P  X  x, Y  y 
px y  
 P X  x Y  y  
p y 
PY  y 
 Lemma: If X and Y are independent, then
p(x|y)=p(x), p(y|x)=p(y), and p(x,y)=p(x)p(y).
Example: Mendel VI
P(homozygous | smooth seed) =
P X  2, Z  1
1
4
PX  2 Z  1 
3
PZ  1
3
4
1
Binomial Distribution
 Suppose there is a random experiment with
two possible outcomes, we call them
“success” and “failure”. Suppose there is a
constant probability p of success for each
experiment and multiple experiments of this
type are independent. Let X be the random
variable that counts the total number of
successes. Then XBin(n,p).
Properties of Binomial
Distribution
 n x
n x
f x; n, p   P X  x n, p     p 1  p 
 x
n x
tx  n  x
t
mgf  X    e   p 1  p   1  p  pe
x 0
 x
E X   np
Var  X   np1  p 
n
Examples: Binomial
Distribution
 recombinant fraction  between two loci:
count the number of recombinant gametes in
n sampled.
 phenotype in Mendel’s F2 cross: count the
number of smooth peas in F2.
Multinomial Distribution
Mn, p1 , p2 ,, pm 
 Suppose you consider genotype in Mendel’s
F2 cross, or a 3-point cross.
 Definition: Suppose there are m possible
outcomes and the random variables X1, X2,
…, Xm count the number of times each
outcome is observed. Then,
n!
P X 1  x1 , X 2  x2 , , X m  xm  
p1x1 p2x2  pmxm
x1! x2 ! xm !
Poisson Distribution
 Consider the Binomial distribution when p is
small and n is large, but np= is constant.
Then,
 x
 n x
e
n x
  p 1  p  
x!
 x
 The distribution obtained is the Poisson
Distribution.
Properties of Poisson
Distribution
e 
f x  
x!
x
e 
 e t 1
mgf  X    e
e
x!
E X   
Var  X   
tx
x
Normal Distribution
 Confidence intervals for recombinant
fraction can be estimated using the Normal
distribution.
N  ,
2
Properties of Normal
Distribution
1
f x  
e
2 
t 
mgf  X   e
E X   
Var  X    2
 2t 2
2
x   2
2 2
Chi-Square Distribution
 Many hypotheses tests in statistical genetics
use the chi-square distribution.
k
1  1  2 k 2 1  x 2
f x  
  x e
k 2
   
 2
1
mgf  X  
for t  0.5
k
1  2t  2
E X   k
Var  X   2k
Likelihood I
 Likelihoods are used frequently in genetic
data because they handle the complexities of
genetic models well.
 Let  be a parameter or vector of parameters
that effect the random variable X. e.g.
=(,) for the normal distribution.
Likelihood II
Then, we can write a likelihood
L   L x1 , xn    Pxi  
n
i 1
where we have observed an independent sample of
size n, namely x1,x2,…,xn, and conditioned on the
parameter .
Normally,  is not known to us. To find the  that
best fits the data, we maximize L() over all .
Example: Likelihood of
Binomial
 n x
n x
L   Ln, p   P X  x n, p     1   
 x
 n
l    log L   log    x log p  n  x  log 1  p 
 x
l   x n  x
 
p
p 1 p
x
pˆ 
n
The Score
 Definition: The first derivative of the log
likelihood with respect to the parameter is
the score.
 For example, the score for the binomial
parameter p is
l   x n  x
 
p
p 1 p
Information Content
 Definition: The information content is
2
 
 
I    E  l  x  
 
 
 2
 E  2 l  x 
 
 If evaluated at maximum likelihood estimate ˆ,
then it is called expected information.
Hypothesis Testing
 Most experiments begin with a hypothesis.
This hypothesis must be converted into
statistical hypothesis.
 Statistical hypotheses consist of null
hypothesis H0 and alternative hypothesis HA.
 Statistics are used to reject H0 and accept HA.
Sometimes we cannot reject H0 and accept it
instead.
Rejection Region I
 Definition: Given a cumulative probability
distribution function for the test statistic X,
F(X), the critical region for a hypothesis test
is the region of rejection, the area under the
probability distribution where the observed
test statistic X is unlikely to fall if H0 is true.
 The rejection region may or may not be
symmetric.
Rejection Region II
Distribution under H0
1-F(xc)
1-F(xl) or 1-F(xu) 
Acceptance Region
Region where H0 cannot be rejected.
One-Tailed vs. Two-Tailed
 Use a one-tailed test when the H0 is
unidirectional, e.g. H0: 0.5.
 Use a two-tailed test when the H0 is
bidirectional, e.g. H0: =0.5.
Critical Values
 Definition: Critical values are those values
corresponding to the cut-off point between
rejection and acceptance regions.
P-Value
 Definition: The p-value is the probability of
observing a sample outcome, assuming H0 is
true.
p - value  1  F xˆ 
 Reject H0 when the p-value.
 The significance value of the test is .
Chi-Square Test: Goodness-ofFit
a
oi  ei 2
i 1
ei
2  
 Calculate ei under H0.
 2 is distributed as Chi-Square with a-1
degrees of freedom. When expected values
depend on k unknown parameters, then df=a1-k.
Chi-Square Test: Test of
Independence
a
b
 2  
i 1 j 1
o
ij  eij 
2
eij
 eij = np0ip0j
 degrees of freedom = (a-1)(b-1)
 Example: test for linkage
Likelihood Ratio Test
LR 
 
L ˆ X
L0 X 
 G=2log(LR)
 G ~ 2 with degrees of freedom equal to the
difference in number of parameters.
LR: goodness-of-fit &
independence test
 goodness-of-fit
n
oi
G  2 oi log
ei
i 1
 independence test
a
b
G  2 oij log
i 1 j 1
oij
eij
Compare 2 and Likelihood
Ratio
 Both give similar results.
 LR is more powerful when there are
unknown parameters involved.
LOD Score
 LOD stands for log of odds.
 It is commonly denoted by Z.
 
 L ˆ X 
Z  log 10 
 L0 X 
 The interpretation is that HA is 10Z times more
likely than H0. The p-values obtained by the LR
statistic for LOD score Z are approximately 10-Z.
Nonparametric Hypothesis
Testing
 What do you do when the test statistic does
not follow some standard probability
distribution?
 Use an empirical distribution. Assume H0
and resample (bootstrap or jackknife or
permutation) to generate:
empirical CDF( X )  P X  x