Download Lect9_2005

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Inductive probability wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Central limit theorem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Lecture 9.
•Law of large numbers.
•Central limit theorem.
•Confidence interval.
•Hypothesis testing. Types of errors.
1
Some practice with the Normal distribution N (, )
(please open Lect9_ClassPractice)
1. Math aptitude scores are distributed as N(500,100).
Find:
a. the probability that the individual score exceeds 600 and
b. the probability that the individual score exceeds 600 given that it
exceeds 500
2. What mathematics SAT score, can be expect to occur with the
probability 0.90 or greater ?
2
The (weak) law of large numbers
(The following result is also known as a Bernoulli theorem).
Let X1 , X2 ..., Xn be a sequence of independent and identically distributed
(iid) random variables, each having a mean EXi =  and standard deviation 
Define a new variable, the “sample mean”
X  X ... X n
X n  1 2n
(9.1)
One should keep in mind that X n is also a random variable
: it randomly changes between different samples of the same size n, and
those changes are sensitive to the size of the sample, n:
Example (Lect9_LargeNumbers.nb).
3
Using the properties of expectations it is easy to prove that
EX n  1 (EX  EX ... EX n )  
1
2
n
(9.2)
Using the properties of the variances (see Lect. 8) we can also
prove that
X  X 2 ... X n
1
Var ( X n )  Var ( 1
)  Var ( X1  X 2  ... X n ) 
n
n2
2
1

{Var ( X )  Var ( X )  ...  Var ( X n )} 
1
2
2
n
n
(9.3)
Then, using the Chebyshev’ inequality, (we use it without prove) one can find
4
2
Var
(
X
)

n
P(| X n   |   ) 

 0 as n 
2
2 n
(9.4)
where  is any (but typically small) positive number. Let us discuss this
result. The left side contains the probability that the deviation between the
sample average, X n and the mean value of the individual random variable,
, exceeds  . The right hand side shows that this probability decreases as
1/n for large n.
Thus, one can say that “ X n converges to  in probability”.
(Weak) Law of Large Numbers.
Suppose X1 ,X2 ...,Xn are i.i.d. with finite E(Xi) = . Then, as n
, X n converges to  in probability.
(9.5)
5
(9.5) could be called the fundamental theorem of statistics because it says that
the sample mean is close to the mean  of the underlying population when the
sample is large.
It implies that we do not have to measure the weights of all American 30 year
old males in order to find their average weight.We can use the weights of 1000
individuals chosen at random.
Later on we will discuss how close the mean on the sample of this size will be
to the mean of the underlying population.
6
Dice average
7
Coins- Average (relative to ½)
8
Coins average (2)
9
Experiments with dice and coins show that the rate of convergence to the
average value is quite slow.
Using (9.4) we can understand why it is so.
Suppose that we flip a coin n = 1000 times and let Xi be 1 if the i-th toss is
Heads and 0 otherwise. The Xi are Bernoulli random variables with
p=P(Xi=1)=0.5. Exi=p=0.5.
VarXi = (1-p)2p+ p2(1-p)=(1-p)p(1-p+p)=p(1-p)=1/4.
Taking =0.05 (which is the 10 % of the expected value), we find:
P(| X n 1/ 2|  0.05) 
1/ 4
 0.1
2
3
(0.05) 10
We can see that the probability of a comparatively large fluctuation (10 % of
the mean value) is not too small. It explains the large irregularities that we can
find in the pictures.
10
Central Limit Theorem.
The Law of Large Numbers indicated that the sample average converges in
probability to the population average. Now our concern are the probability
distributions. Suppose that we know a distribution for i.i.d. Xi. We wonder
how the distribution of the sum of Xi is related to the individual distributions.
Reiteration: assuming we know pdf(Xi), what can we tell about the pdf( X n )
for large n.
Central Limit Theorem
(First formulation)
Suppose X1 ,X2 ...,Xn are i.i.d. with E(Xi)=  and finite Var(Xi)= 2. Then,
as n
X 
P( n  x)  P(  x)
/ n
(9.5)
where  denotes a random variable with the standard normal
distribution.
11
Another formulation (which is essentially the same, but
provides some additional insight)
Let Sn =X1 +,X2+...,+Xn be the sum of n discrete i.i.d. with E(Xi)=  and
Var(Xi)= 2. Then, for any x
Sn  n
1 x
2
P(
 x) 
 Exp[ z / 2] dz (9.6) *
 n
2  
We already know from previous that
x
P( X  x)   pdf ( z ) dz ( A) *

is the probability function for the continuous distribution
(see also “ cumulative distribution function”)
Let us now compare the equations (9.6) and (A).
*Please notice that z in these equations is a “silent” variable. Its name can be changed by any other
letter and does not affect the outcome of the integration. In my experience this naming issue often
causes confusion

12
Now we are well equipped for the final step.
Sn  n
1 x
2
P(
 x) 
 Exp[ z / 2] dz (9.6)
 n
2  
P(X < x) = Integrate[pdf[ z ],{ z, -Inf, x}]
(A)
Comparing (9.6) and (A) we discover that the
distribution function for the (standard) random variable
S n  n
 n
asymptotically approaches the standard normal distribution.
Comment: Sn can be dimensional variable: height, weight, gas pressure, etc.
Try proving that   S n  n is always dimensionless.
 n
13
Examples
Example 1: Suppose we flip a coin 900 times. What is the
probability to get at least 465 heads?
Sn  n 
P (a 
)
2
n
  0.5,
n   900 *
 2  p(1  p )  0.25;
1
 450,
2
Sn  465 leads to
Check it!
1 
2
 Exp[  x / 2]dx
2 a
 n  15
Sn  n 
 n

465  450 15

 1; P (   1)  0.159
15
15
14
15
As a matter of fact. CLT indicates that such an
assumption is not completely unreasonable.
According to CLT, “the sums of independent random
variables tend to look normal no matter what crazy
distribution the individual variables have” (G&S,p.362).
16
Question for practice (G&S,9.3.7)
Extra-credit
Choose independently 25 numbers from the interval [0,1]
with the probability density f(x) given below and
compute their sum s25. Repeat this experiment 100 times
and make a bar graph of the result. How well does the
normal density fit your bar graph in each case?
(a) f(x)=1
(b) f(x) = 3 x2
(c) f(x) = 2 - 4|x-1/2|
17
CLT in statistics
18
Confidence Interval
In statistics we always deal with the limited samples of population.
Usually, the goal is to draw inferences about a population from
a sample. This approach is called “Inferential statistics”.
Suppose that we are interested in the mean number of words that
can be remembered by a high school student.
First, we have to select a random sample from the population.
Suppose that the group of 15 students is chosen.
We can use the sample mean as an unbiased estimate of  , the
population mean value. However, given the small size of the
sample, it will certainly not be a perfect estimate.
By chance it is bound to be at least either a little bit too high or a
19
little bit too low.
For the estimate of μ to be of value, one must have some idea of how
precise it is. That is, how close to μ is the estimate likely to be?
The CLT helps answering this question. It tells us that the sample mean
is a random variable normally distributed near the population mean.
This can help us to derive a probabilistic estimate of μ based on the
sample mean.
An excellent way to specify the precision is to construct a confidence
interval.
4.1 Confidence interval,  is known
Let’s assume for simplicity that we have to estimate μ while
(this case is not very realistic but can give us an idea).
 is known
20
It can be shown (check it with Mathematica) that for the
standard normal distribution
P(2    2)~0.95 *
Noticing now that the standardized variable is
Xn 

/ n
we have
P( 2  X n    2 )~0.95
n
n
*More precisely, the 95% range is (-1.96, 1.96), but using (-2,2) is good enough for all practical
purposes
21
In other words, with the probability ~0.95,  belongs
to the interval
[ X n  2 , X n  2 ]
n
n
We say that this is a 95% confidence interval for

22
Example 1.
Suppose that we have a random sample of size 100 from a
population with standard deviation 3 and we observe a sample
 66.32 . What can we tell about μ?
mean X
100
In this case
2  6/10  0.6
n
As a result,
  66.32  0.6
23
Example 2.
How large a sample must be selected from a normal
distribution with standard deviation 12 in order to estimate 
to within 2 units
2  2  n 144 .
n
24
In the previous example it was assumed that the
Variance is known. It is much more common that neither
mean nor the variance is known when the sampling is
done. As a rough estimate, one can use the sample
variance:
2
2
2
(
x


)

(
x


)

....

(
x


)
2
n
2 1
n
This will lead to a normalized variable
S n  n
Tn 
n
Which is a function of two random variables.
25
It was shown by W.S. Gosset that the density function
for Tn is not normal but rather a t-density with n degrees
of freedom. For large n t-density approaches the normal
distribution. In general, t-density closely related the
Chi-squared distribution
e( x / 2) x( r / 2) 1
f ( x) 
, x 0
2n (r / 2)
This is called the Chi-squared distribution with n degrees
of freedom.

(r / 2)  0 e x x r 1dx; (r / 2)  (r  1)! if r is a positive integer.
26
This distribution describes the sum of squares of n
independent random variables.
It is very important for comparing experimental data
with a theoretical discrete distribution to see whether the
data support the theoretical model.
Please read Chapter 7.2 of G&S for details. Here we
consider only simple examples of hypothesis testing.
27
Hypothesis testing
We consider now two types of hypothesis testing:
Testing the mean and testing the difference between
two means
28
Example 1. Suppose we run a casino and we wonder
if our roulette wheel is biased. To rephrase our
question in statistical terms, let p be the probability
red comes up and introduce two hypothesis
H0: p = 18/38
null hypothesis
H1: p  18/38
alternative hypothesis
To test to see if the null hypothesis is true, we spin the
roulette n times and let Xi=1 if red comes up on the i-th
trial and 0 otherwise, so that X n (the observed mean
value) is the fraction of times red came up in the first n
trials.
29
How to decide if H0 is correct?
It is reasonable to suggest that large deviations of X n from
the “fair” value 18/38 would indicate that H0 failed. But how
to decide which deviations should be considered “large”?
The concept of the “confidence interval” can help answering
this question. As we know, with probability ~95%
X n should belong to the confidence interval
{  2 / n ,   2 / n} {18  2 / n , 18  2 / n}
38
38
Reminder: n is the sampling size or the number of the iid (independent
and identically distributed) variables added together,  and  are
their individual mean and standard deviation values.
30
Rejecting H0 when it is true is called a type I error.
Accepting H0 when the alternative hypothesis is true is
In
called
this atest
type
weIIhave
errorset the type I error to be 5%
Thus, we can say that if X n falls outside the chosen interval,
then the H0 can be rejected with possible error less then 5%.
Let us find now the individual standard deviation. The
variable Xi takes only two values: 0 and 1.
Probability(1)=18/38, Probability(0)= 20/38. Using this, we
can find
<Xi2>= 1*18/38+0*20/38=18/38. (I use <…> notation for
the mean.)
31
 2  Xi2    Xi 2 18  (18)2  18 20
38 38
38 38
18 20

~ 0.5
38 38
Thus, 2 σ ~ 1 and the test can now be formulated as:
reject H
18
1
if | X n  | 
0
38
n
or in terms of the total number of reds
Sn = S1 +S2+ … Sn,
reject H
18n
if | Sn 
|
0
38
n
32
Suppose that we spin the wheel 3800 times and get red
1869 times. Is the wheel biased? The expected number of
reds is (18/38)n=1800. Is the deviation from this value
| Sn – 1800| = 69
indicates that the wheel is biased? Given the large
numbers of trials, this deviation might not seem large.
However, it must be compared with n1/2 = (3800)1/2 ~ 61.6
Given that 69 > 61.6 we have to reject H0 with a notion
that if H0 were correct then we would see and observation
this far from the mean less than 5% of the time.
Obviously, it does not prove the wheel is biased but shows
that it is very likely to be true.
33
Example 2.
Do married college students with children do less well
because they have less time to study or do they do better
because they are more serious?
The average GPA for all students was 2.48. To answer
the question, we formulate two hypothesis:
H0:  = 2.48
null hypothesis
H1:   2.48
alternative hypothesis
34
The records of 25 married college students with
children indicate that their GPA average was 2.36
with the standard deviation of 0.5.
Using the last example, we see that to derive a
conclusion with type I error of 5% we should reject
H0 if
20.5
| X  2.48|  2 
 0.2
i
n
25
We can see that the observed deviation |Xi -2.48|=0.12
< 0.2. It means that we can not reject the 0-hypothesis. In other
words, we can not be 95% certain that married students with
children do either worth or better that the students without
children.
35
Testing the difference of two means
Suppose that we want to compare the test results for two
independent random samples of size n1 and n2 taken from two
different populations. A typical example would be testing a
drug, one sample taken from the population taking the drug
and another – from the reference group.
Suppose now that the population means are unknown.
For example, we do not know what % of population would be
infected by a certain disease if the drug is not taken.
Notice that in the previous case we were able to calculate the
mean for the 0-hypothesis. Now we can not.
36
H0 : 1  2
- null hypothesis
H1 = 1  2
- alternative hypothesis
From the CLT:
X N ( , 2 / n );
1
1 1 1
-X N ( , 2 / n );
2
2 2
2
If H0 correct, then
X  X N (0,  2 / n  2 / n );
1 2
1 1 2 2
37
Based on the last result, if we want a test with a type
1 error of 5% then we should reject H0 if
| X 1  X 2 |  2  / n1   2 / n2
2
1
2
Example: study of passive smoking reported in
NEJM.
The size of lung airways was taken for 200 female
nonsmokers who were in a smoky environment and
for 200 who were not. For the 1-st group the average
was 2.72 and sigma= 0.71, while for the second
group the corresponding values were 3.17 and 0.74.
38
Based on these data we find
2  2 / n  2 / n  2 0.712 / 200  0.74 2 / 200 0.14503
1 1 2 2
while | X1  X 2 | 0.45 (
noticably exceeds 0.145 )
This means that the conclusions are convincing
(should be taken seriously)
39