Download Chapter 5 Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
Chapter 5
Sampling Distributions
Introduction
• Distribution of a Sample Statistic: The probability
distribution of a sample statistic obtained from a
random sample or a randomized experiment
– What values can a sample mean (or proportion)
take on and how likely are ranges of values?
• Population Distribution: Set of values for a
variable for a population of individuals.
Conceptually equivalent to probability distribution
in sense of selecting an individual at random and
observing their value of the variable of interest
Sampling Distributions for Counts and
Proportions
• Binary outcomes: Each individual or realization can be
classified as a “Success” or “Failure”
(Presence/Absence of Characteristic of interest)
• Random Variable X is the count of the number of
successes in n “trials”
• Sample proportion: Proportion of succeses in the
sample
• Population proportion: Proportion of successes in the
population
X
Sample Proportion : p 
n
^
Population Proportion : p
Binomial Distribution for Sample Counts
• Binomial “Experiment”
– Consists of n trials or observations
– Trials/observations are independent of one another
– Each trial/observation can end in one of two possible
outcomes often labelled “Success” and “Failure”
– The probability of success, p, is constant across
trials/observations
– Random variable, X, is the number of successes observed in
the n trials/observations.
• Binomial Distributions: Family of distributions for X,
indexed by Success probability (p) and number of
trials/observations (n). Notation: X~B(n,p)
Binomial Distributions and Sampling
• Problem when sampling from a finite sample: the
sequence of probabilities of Success is altered after
observing earlier individuals.
• When the population is much larger than the sample
(say at least 20 times as large), the effect is minimal and
we say X is approximately binomial
• Obtaining probabilities:
n k
P( X  k )    p (1  p) n k
k 
n
n!
  
k  0,1,, n
 k  k!(n  k )!
Table C gives probabilities for various n and p. Note that for p
> 0.5, use 1-p and you are obtaining P(X=n-k)
Example - Diagnostic Test
• Test claims to have a sensitivity of 90% (Among people
with condition, probability of testing positive is .90)
• 10 people who are known to have condition are
identified, X is the number that correctly test positive
10  k 10k
P( X  k )   (.9) (.1)
k
k
P(k)
0
1E-10
1
9E-09
10 
10!
  
k  0,1,,10
 k  k!(10  k )!
2
3
4
5
6
7
8
9
10
3.64E-07 8.75E-06 0.000138 0.001488 0.01116 0.057396 0.19371 0.38742 0.348678
• Compare with Table C, n=10, p=.10
• Table obtained in EXCEL with function: BINOMDIST(k,n,p,FALSE)
(TRUE option gives cumulative distribution function: P(Xk)
Binomial Mean & Standard Deviation
•
•
•
•
•
•
•
Let Si=1 if the ith individual was a success, 0 otherwise
Then P(Si=1) = p and P(Si=0) = 1-p
Then E(Si)=mS = 1(p) + 0(1-p) = p
Note that X = S1+…+Sn and that trials are independent
Then E(X)=mX = nmS = np
V(Si) = E(Si2)-mS2 = p-p2 = p(1-p)
Then V(X)=sX2 = np(1-p)
X ~ B(n, p)
E( X )  m X  np
s X  np(1  p)
For the diagnostic test: m  10(0.9)  9.0 s  10(0.9)(0.1)  0.95
Sample Proportions
• Counts of Successes (X) rarely reported due to
dependency on sample size (n)
• More common is to report the sample proportion of
successes:
# of successes in sample X
p

sample size
n
^
^
E p   m ^  p
p
 
p (1  p )
^
2
V p  s ^ 
p
n
 
s 
^
p
p(1  p)
n
Sampling Distributions for Counts &
Proportions
• For samples of size n, counts (and thus proportions)
can take on only n distinct possible outcomes
• As the sample size n gets large, so do the number of
possible values, and sampling distribution begins to
approximate a normal distribution. Common Rule of
thumb: np  10 and n(1-p)  10 to use normal
approximation

X ~ N np, np(1  p)

p ~ N  p,

^
p(1  p) 


n


(approxima tely)
(approxima tely)
Sampling Distribution for X~B(n=1000,p=0.2)
Sampling Distribution of X (n=1000,p=0.2)
0.035
0.03
Probability
0.025
0.02
0.015
0.01
0.005
981
953
925
897
869
841
813
785
757
729
701
673
645
617
589
561
533
505
477
449
421
393
365
337
309
281
253
225
197
169
141
113
85
57
29
1
0
# Successes
m X  np  1000(.20)  200 s X  np(1  p)  1000(.2)(.8)  12.65
Using Z-Table for Approximate Probabilities
• To find probabilities of certain ranges of counts or proportions,
can make use of fact that the sample counts and proportions are
approximately normally distributed for large sample sizes.
–
–
–
–
–
Define range of interest
Obtain mean of the sampling distribution
Obtain standard deviation of sampling distribution
Transform range of interest to range of Z-values
Obtain (approximate) Probabilities from Z-table
^

Coin Tossing(He ads) : P p  0.51 | n  1000 tosses 


^
Range : p  0.51 Mean : p  0.50 SD :
(0.5)(0.5)
 .0158
1000
^
z
p m ^
p
s
^
p

0.51  0.50
 0.63
.0158
P ( Z  0.63)  1  P ( Z  0.63)  1  .7357  .2643
Sampling Distribution of a Sample Mean
• Obtain a sample of n independent measurements of a
quantitative variable: X1,…,Xn from a population with
mean m and standard deviation s
– Averages will be less variable than the individual
measurements
– Sampling distributions of averages will become more like a
normal distribution as n increases (regardless of the shape of
the population of individual measurements)
 
1
 1
E X  E   X i    nm  m X  m
n
 n
 
2
s
1
 1
V X  V   X i     ns 2  s X2 
n
n
 n
2
sX 
s
n
Central Limit Theorem
• When random samples of size n are selected from aamy
population with mean m and finite standard deviation s,
the sampling distribution of the sample mean will be
approximately distributed for large n:
 s 
X ~ N  m,

n

approximat ely, for large n
Z-table can be used to approximate probabilities of ranges of
values for sample means, as well as percentiles of their sampling
distribution
Exponential Distribution
• Often used to model times: survival of components, to
complete tasks, between customer arrivals at a checkout
line, etc. Density is highly skewed:
Sample means of size 10 (m=1, s=1/100.5=0.32)
y
0
0
.2
.5
.4
y
.6
1
.8
1
1.5
Individual Measurements (m=1,s=1)
0
1
2
3
x
4
5
0
1
2
3
x
4
5
Miscellaneous Topics
• Normal Approximation for sample counts and
proportions is example of CLT (X=S1+…+Sn)
• Any linear function of independent normal
random variables is normal (use rules on means
and variances to get parameters of distribution)
• Generalizations of CLT apply to cases where
random variables are correlated (to an extent) and
have different distributions (within reason)
– Variables made up of many small random influence will
tend to be approximately normal