Download 6 sampling distribut..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
SAMPLING
DISTRIBUTION
OF MEANS &
PROPORTIONS
SAMPLING AND
SAMPLING VARIATION
Sample
Knowledge of students
No. of red blood cells in a person
Length of the life of electric bulbs
Population
Population census– whole
population
Repeat the same study, under
exactly similar conditions, we will
not necessarily get identical
results.
 Example: In a clinical trail of 200 patients
we find that the efficacy of a particular drug
is 75%
If we repeat the study using the same drug
in another group of similar 200 patients we
will not get the same efficacy of 75%. It
could be 78% or 71%.
“Different results from different trails
though all of them conducted under the
same conditions”
Example: If two drugs have the same efficacy
then the difference between the cure rates of
these two drugs should be zero.
But in practice we may not get a difference of
zero.
If we find the difference is small say 2%, 3%,
or 5%, we may accept the hypothesis that the
two drugs are equally effective.
On the other hand, if we find the difference to
be large say 25%, we would infer that the
difference is very large and conclude that the
drugs are not of equally efficacy.
Example: If we testing the claim of
pharmaceutical company that the
efficacy of a particular drug is 80%.
We may accept the company’s claim if we
observe the efficacy in the trail to be 78%,
81%, 83% or 77%.
But if the efficacy in trail happens to be 50%,
we would have good cause to feel that true
efficacy cannot be 80%.
And the chance of such happening must be
very low. We then tend to dismiss the claim
that the efficacy of the drug is 80%.
 THEREFORE
“WHILE TAKING DECISIONS BASED
ON EXPERIMENTAL DATA WE MUST
GIVE SOME ALLOWANCE FOR
SAMPLING VARIATION “.
“VARIATION BETWEEN ONE SAMPLE
AND ANOTHER SAMPLE IS KNOWN AS
SAMPLING VARIATION”.
 Inference – extension of results obtained from an
experiment (sample) to the general population
 use of sample data to draw conclusions about entire
population
 Parameter – number that describes a population
 Value is not usually known
 We are unable to examine population
 Statistic – number computed from sample data
 Estimate unknown parameters
 Computed to estimate unknown parameters
 Mean, standard deviation, variability, etc..
Notations
population mean
sample mean
SAMPLING
DISTRIBUTION
The sample distribution
is the distribution of all
possible sample means
that could be drawn
from the population.
SAMPLING DISTRIBUTIONS
What would happen if we took many samples of
10 subjects from the population?

Steps:
1.
2.
3.
4.
Take a large number of samples of size 10 from the population
Calculate the sample mean for each sample
Make a histogram of the mean values
Examine the distribution displayed in the histogram for shape,
center, and spread, as well as outliers and other deviations
 How can experimental results be trusted? If x is
rarely exactly right and varies from sample to
sample, how it will be a reasonable estimate of the
population mean μ?
 How can we describe the behavior of the statistics
from different samples?
 E.g. the mean value
Very rarely do sample values coincide
with the population value (parameter).
The discrepancy between the sample value
and the parameter is known as sampling
error, when this discrepancy is the result
of random sampling.
Fortunately, these errors behave
systematically and have a characteristic
distribution.
A sample of 3 students from a class –
a population of 6 students and measure
students GPA
Student
GPA
Susan
2.1
Karen
2.6
Bill
2.3
Calvin
1.2
Rose
3.0
David
2.4
Draw each possible
sample from this
‘population’:
Karen 2.6
Susan 2.1
Calvin 1.2
David 2.4
Bill 2.3
Rose 3.0
With samples of n = 3
from this population of
N = 6 there are 20
different sample
possibilities:
N
N!
6  5  4  3  2 1 720
  


 20
 n  n!( N  n)! 3  2 13  2 1 36
Note that every different sample
would produce a different mean
and s.d.,
ONE SAMPLE = Susan + Karen +Bill / 3
= 2.1+2.6+2.3 / 3
X = 7.0 / 3
= 2.3
Standard Deviation:
(2.1-2.3) 2 = .22 = .04
(2.6-2.3) 2 = .32 = .09
(2.3-2.3) 2 = 02 = 0
s2=.13/3 and s =
.043 =.21
So this one sample of 3 has a mean of 2.3 and a sd of .21
What about other
samples?
 A SECOND SAMPLE
= Susan + Karen + Calvin
= 2.1 + 2.6 + 1.2
X = 1.97
SD = .58
 20th SAMPLE
= Karen + Rose + David
= 2.6 + 3.0 + 2.4
X = 2.67
SD = .25
 Assume the true mean of the
population is known, in this simple
case of 6 people and can be
calculated as 13.6/6 =  =2.27
 The mean of the sampling
distribution (i.e., the mean of all 20
samples) is 2.30.
 Sample mean is a random variable.
 If the sample was randomly drawn, then any
differences between the obtained sample mean and
the true population mean is due to sampling error.
 Any difference between X and μ is due to the fact
that different people show up in different samples
 If X is not equal to μ , the difference is due to
sampling error.
 “Sampling error” is normal, it is
to-be-expected variability of samples
What is a Sampling
Distribution?
 A distribution made up of every
conceivable sample drawn from a
population.
 A sampling distribution is almost always
a hypothetical distribution because
typically you do not have and cannot
calculate every conceivable sample
mean.
 The mean of the sampling distribution is
an unbiased estimator of the population
mean with a computable standard
LAW OF LARGE NUMBERS
If we keep taking larger and larger samples, the
statistic is guaranteed to get closer and closer
to the parameter value.
N=1
N = 10
N=2
N = 25
ILLUSTRATION OF
SAMPLING
DISTRIBUTIONS
Draw 500 different SRSs.
What happens to the shape of the sampling
distribution as the size of the sample increases?
500 Samples of n = 2
500 Samples of n = 4
500 Samples of n = 6
500 Samples of n = 10
500 Samples of n = 20
Key Observations
 As the sample size increases the mean
of the sampling distribution comes to
more closely approximate the true
population mean, here known to be 
= 3.5
 AND-this critical-the standard error-that
is the standard deviation of the sampling
distribution – gets systematically
narrower.
Three main points about
sampling distributions
 Probabilistically, as the sample size gets bigger
the sampling distribution better approximates a
normal distribution.
 The mean of the sampling distribution will
more closely estimate the population
parameter as the sample size increases.
 The standard error (SE) gets narrower and
narrower as the sample size increases. Thus,
we will be able to make more precise estimates
of the whereabouts of the unknown population
mean.
THE MEAN OF THE
SAMPLING DISTRIBUTION
MX  
The mean of a sampling distribution ( M X ) is made up
of all possible SRSs of the same size as your sample. Its
mean will equal the population mean from which it was
drawn.
The distribution of sample means will be normally
distributed, centered at the population mean  with a
standard deviation of the sampling distribution, called the
standard error (SE).
 Don’t get confuse with the terms of
STANDARD DEVEIATION
and
STANDARD ERROR
Quantifying Uncertainty
 Standard deviation: measures the
variation of a variable in the sample.
 Technically,
s
N
1
N 1
(x
i 1
i
 x)
2
 Standard error of
mean is calculated
by:
sx  sem 
s
n
Standard deviation versus
standard error
 The standard deviation (s) describes
variability between individuals in a
sample.
 The standard error describes variation of a
sample statistic.
 . The standard deviation describes how
individuals differ.
 The standard error of the mean
describes the precision with which we
can make inference about the true
mean.
Standard error of the
mean
 Standard error of the mean (sem):
s
sx  sem 
n
 Comments:
 n = sample size
 even for large s, if n is large, we can get
good precision for sem
 always smaller than standard deviation (s)
Proportions
A proportion or percentage is a mean: it is a
mean of a variable that takes on the values 0
and 1. The event of interest is coded 1.
The CLT then applies to proportions as it
does to means. For a 0/1 variable, the
population is necessarily not normally
distributed, but the CLT says that for a
proportion calculated from a large sample
the sampling distribution will be normally
distributed.
Notation
p = population proportion
=
sample
proportion
p̂
n = sample size
CLT suggests:
 pˆ 
 pˆ 
mean of sampling distribution
of proportion ‘ p’
standard deviation of
sampling distribution of
proportion
For a 0/1 variable, the standard
deviation simplifies to a simple function
of the proportion ones in the population:
  p(1  p)
The standard deviation of the sampling
distribution then simplifies as follows:
 pˆ 

n

p(1  p)
n

p(1  p)
n
Normality of Sampling
Distributions
In small samples, the sampling distribution
of a proportion will not be normally shaped
because the population of a normal.
Rule of thumb: the sampling distribution is
close enough to normal to use the normal
table if
np10 and n(1-p)10
Otherwise, we cannot do the problem with
the normal table.
SLOGAN TO REMEMBER
 Sample Mean
+ Sampling Error
= The Population Mean
 Some Sample Characteristic
+ Sampling Error
= The Population
Characteristic
Two Steps in Statistical
Inferencing Process
1. Calculation of “confidence intervals” from
the sample mean and sample standard
deviation within which we can place the
unknown population mean with some
degree of probabilistic confidence
2. Compute “test of statistical significance”
(Risk Statements) which is designed to
assess the probabilistic chance that the true
but unknown population mean lies within
the confidence interval that you just
computed from the sample mean.