Download statistics_sampling_theory2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Statistics
Sampling Theory
Henry Mesa
We hear statistical results on the news
constantly:
“Bifar has been clinical shown to work better than Pliaff
on removing warts,”
“The chance of dying from heart disease if you smoke is
blank times higher.”
But how do we know that these results are
correct? How are these results obtained?
But how do we know that these results are
correct? How are these results obtained?
First, we can never be 100% certain of anything. But, we would like to be
as close to 100% as possible.
I want to demonstrate with a simple example what this theory is about.
My friend has a die that he claims is fair. I am however skeptical, but since I have
no proof I will assume it is fair, and continue to look for evidence that it may not
be fair.
Now, the game begins I keep track of how many 3’s show up (I could keep track of
other values as well but I want to keep this as simple as possible).
Lets say that out of 100 throws I witness twenty-five 3’s appearing. Should I be
alarmed? Is this evidence that the die is not fair? Is 25 appearances an unusual
number?
The game continues and on the next 100 throws twenty-nine 3’s appear. Again,
the same question is asked. Is this evidence that the die is not fair? Is 29
appearances an unusual number?
What we need to understand is what the distribution of counts of 3’s are from a
sample of 100 throws. It turns out that the mathematical model is well
understood; the distribution of this scenario is characterized by a binomial
distribution (do not worry about not understanding how to create a binomial
distribution).
To the left is the binomial
distribution describing this event
-throwing a die 100 times and
counting number of 3’s)- this is
one event.
To create this graph I had to
assume the die was fair.
X-axis counts number of three’s that appear out of 100 throws
I can see from the graph that the
chance of witnessing twenty –five
threes out of 100 throws is
unusual, and twenty-nine is
extremely unusual if the die is
fair.
I now realize that I am
witnessing very unusual results if
the die is fair, after all if the die is
fair I was expecting (1/6)(100)
about 16 or 17 threes to appear.
Twenty five and twenty-nine are
very unusual according to the
model.
The model
This is what we are going to
concentrate on. The model
allowed us to interpret the results
from our data. Without models it
is very difficult to say anything
meaningful about statistical results.
I now realize that I am
witnessing very unusual results if
the die is fair, after all if the die is
fair I was expecting (1/6)(100)
about 16 or 17 threes to appear.
Twenty five and twenty-nine are
very unusual according to the
model.
The model
But we have already worked with models. The normal density curve is an example
of a model, so is the uniform distribution. All the idealized functions we use in
statistics are examples of models. What we are going to do now, is come up with a
model for the summaries, the sample mean and sample proportions, due to the
action of sampling from some defined population.
I will go back to the die example and finish that off, but let me say one more
thing about models to convince you of the importance of the idea of models.
In this example I will make some assumptions. You are an American with no
knowledge of the indigenous people of Brazil.
You are all of a sudden dropped off in the middle of the Amazon.
A man approaches you, and says that he is to marry a woman from the other village
who he has never met. He tells you (just like in 60’s outer space sci-fi movies
everyone knows English) I mainly want her to be beautiful. I can not visit or see her
until after we are married. Can you please go to the next village and see for me?
So, you trek over to the other village and you meet the
woman.
Now you come to realize something very important. What
does beautiful mean to this man? What does beautiful
mean in this society?
You know that , for example, “Hollywood beauty” would mean nothing in this society,
that is, the model of what beautiful is in “Hollywood” would not apply here.
Thus, you have data, what the person in front of you
looks like, but no way to gauge what that means.
Your only hope is to visit the man’s village and start
pointing to women in his village he finds attractive
and hope to reconstruct a good enough model of
“beautiful” so that you can make a decision.
This idea of the model is key to statistics. Without it you are just floating, and
guessing. You may have lots of data, but with no means to interpret the results.
So now we are back to our die problem and the concept of a mathematical model.
Equations come in families and types. For example, the equations y = 5x -1,
y = -3x, y = 0.5x + 9 I hope you recognize are all linear.
A linear equation is of the form y = mx +b where x and y are variables and m and b
are constants as you learned in algebra. Now, a specific line is determined by the
values m and b, the slope and y-coordinate of the y-intercept. The values m and b
are called parameters for linear equations. Those values determine specific lines.
For the die problem, our model was the binomial distribution; a binomial
distribution is an example of a model. What are the parameters for the binomial
distribution? It has two: the sample size n and p the probability of the said event
occurring.
For the die problem, n = 100 and p = 1/6.
Change the parameters, and you change the distribution.
Ok, I see that models are important, and that mathematical models
require parameters to make them specific. What does that have to do
with sampling? Statistics?
The die example was unique in that I could assume the die was fair and
then by making that assumption I knew p = 1/6. And the sample size
was determined by the number of observations.
Often, we do not know the value of the parameter, but we wish to
estimate it. Lets assume that a die is altered so that it is “loaded.” The
probabilities associated with a fair die do not apply. How could I
discover what are the probabilities associated with the outcomes of
that die? The only thing we can do is toss the die and see what
numbers show up, and how often. But now we are back to the
“beautiful” woman example. If I toss the die 100 times I get one set of
numbers. If I toss the die 100 times again, I get another set of
numbers. If we all toss the die 100 times we all get different sets of
numbers. Who is right? Hopefully, the values are close together and
we can get a sense of the true values associated with the loaded die.
We need a general theory to deal with these discrepancies. We know that
the probabilities are long run values, so how big should I make my sample
so that I feel I have a good estimate of the unknown values.
For drug tests: How many people would I need to test the drug on so I
can be certain to some degree if the drug works or does not work? And
by work we mean that a certain proportion of the population, p, can be
helped by the drug.
General Theory
To understand how things work, we will assume, for the moment
(chapter 5), we know our parameters, and the distribution types, and
then show you what occurs when you sample and then calculate an
estimate of some parameter (a statistic).
Chapter 5 will use the two parameters, , population mean, and p,
population proportion as examples of what occurs to our estimates
(called a statistic) when we sample.
A parameter (often called the true value) is a number that
describes some characteristic of a population. We think of this
number as being fixed (does not change). Now keep in mind that
if the characteristics that the parameter is describing in the
population change, so does the parameter. If the characteristic
does not change neither does the parameter.
A statistic is an estimate of a parameter. The way a statistic is
gathered is by taking a sample from the population you are
interested in and then calculating the desired measurement.
For example, let us say we want to know the exact mean weight of 18 year
old men or older in the United States at this exact instant . By exact, I mean
the parameter . The only way to calculate that number is to measure every
male in the United States that meets the criteria; I need a census.
If I gathered a sample of 1000 men that fit the description, and weighed
them, then calculated the mean of the sample, this is the sample mean (xbar), which is a statistic.
So a parameter is calculated from a census (the whole
population/sample space is measured), while a statistic is calculated
from a sample (subset of the population).
Here are some examples of parameters with their corresponding
statistics.
Parameter
 - population mean
 – population Standard deviation
p - population proportion
Statistic
- Sample mean
s – Sample Standard deviation
p̂ - sample proportion
To explain a sampling distribution one could use means or
proportions, both can convey the general theories we are looking to
explain. I will choose the population mean as my example.
Let the random variable X measure the time required for a person to go
from home to work in a large city. Suppose that the mean time
required for a person to go from home to work is 45.3 minutes (µ =
45.3 minutes this is the population mean), with a standard deviation of
7.3 minutes (σ = 7.3 minutes is the population standard deviation).
Notice, I have not said what the distribution above is like, I have merely
stated what the mean and standard deviation of this population are. How
did I get that parameter? At the moment we are pretending we know all the
parameters, so that we can understand what occurs when we sample.
Suppose we sample four individual drivers at random from this city.
x4
x1
x2
x3
I then take those four numbers and average them. What I have
just calculated is the sample average, x-bar.
x1 + x 2 + x 3 + x 4
 x1
4
This is one possible value of the sample mean.
x
x
x
x
x1
Now that one sample mean, is one number out of infinitely many
possible numbers. If we repeated the sampling we would get
different numbers.
x2
Population of some characteristic of a population
that I am interested in knowing more about.
x
x
x
Population of a statistic that estimates some
parameter that allows me to characterize some
aspect of the population I am studying.
x
x1
x3
x2
Now that one sample mean is one number out of infinitely many
possible numbers. If we repeated the sampling we would get
different numbers.
This creates a distribution of sample means (The sampling distribution of
the mean). In general terms we have created a distribution of some
statistic. This is the key to understanding and interpreting data results.
x1 x x 3 x 6 x 4
5
x2
When you sample from a population you only generate one sample mean. But
that one sample mean is one number out of infinitely many possible numbers.
If we repeated the sampling we would get different numbers.
And then we ask the next question. If my one number is part of an
infinite set of sample means, we understand that this creates a
distribution of sample means. The next logical question is “what is
the mean of the sample averages?”
μx  ?
Now, this may seem strange but the answer is not at all cemented in
stone. The mean of the sampling distribution will depend on how we
sample.
A better question to ask is, what would we want it to be?
Think about it for a second. This is a very important question. If
you could make the mean of the sample averages any number you
want what would be that number?
μ x  45.3
When ever I have posed this question eventually someone says
they would like the mean from the population we are sampling
from to equal the mean of the sampling distribution of the means.
This is exactly what we want! And here is the reason. Suppose
that it was not this way. What would be the ramifications?
This is exactly what we want! And here is the reason. Suppose
that it was not this way. What would be the ramifications?
xx4
5
x
1
x6
x3
μ x  37.9
If the sample-mean mean was less, for example, the individual
sample means would consistently underestimate the actual
population mean of 45.3 minutes. This is not good.
μ x  45.3
Now, equality of the means of the two distribution does not occur
by chance. It must be engineered to occur this way. How you
ask?
This is the purpose of the sections in chapter three’s
introduction, section 1 and section 2, which concern sampling
methods. Our sampling method should consistently produce
representative samples with no or very little bias.
=
μ x  45.3
If our sampling method creates sample means whose distribution
mean is not equal to the population mean we sampled from then
the method of sampling is said to be biased.
This is the purpose of proper randomization! A proper
randomized sampling method should create in the long run an
unbiased sampling distribution mean.
=
μ x  45.3
What about the standard deviations of each group? How are they
related?
Well, in this case we can not wish it to be true, it will be true because
of physics of the situation, assuming that our sampling method is not
biased. But lets ask the question anyway.
=
X
μ x  45.3
What would we like the relationship to be?
<
=
?>
X
=
μ x  45.3
What would we like the relationship to be?
X
<
?
=
X
>
=
μ x  45.3
X
Why?
If
>
X
 X is small then this means x-bar values will not
vary by much; all possible x-bar values will
be very close together. But if that is the
case, then we will get a good idea of what 
really is.
=
X
>
X
The exact relationship is given by

X  X
n
μ x  45.3
=

X  X
μ x  45.3
n
Now we have relationships established for the mean and
the standard deviations of both populations. You have
noticed that I have not talked about the distribution itself.
Is the distribution left skewed, right skewed, symmetric, or
some other distribution?
We now embark on one of the most important theoretical
facts about sampling distributions concerning means.
Lets say that the distribution of times is uniformly distributed.
 X =8.66
I sample two values at a time
randomly. Then I calculate the
average.
What does the distribution for
all possible averages of two
numbers look like if we
sampled from a uniform
distribution?
μ x  45
 X  8.66  6.12
2
Lets say that the distribution of times is uniformly distributed.
 X =8.66
I sample four values at a time
randomly. Then I calculate the
average.
What does the distribution for
all possible averages of four
numbers look like if we
sampled from a uniform
distribution?
μ x  45
 X  8.66  4.33
4
As we continue to increase the sample size the distribution
of sample means continues to take on the shape of a normal
distribution.
 X =8.66
Suppose I sample 15 values at a
time and calculate the sample
mean.
The distribution is
approximately
normal.
What does the distribution for
all possible averages of fifteen
numbers look like if we
sampled from a uniform
distribution?
μ x  45
The Central Limit TheoremSuppose we sample from a distribution that is not normally distributed,
and we calculate the sample mean. The sample mean belongs to a new
population (sample space) consisting of all possible sample mean
averages for that particular sample size.
The Central Limit Theorem says if the sample size is large enough then
the distribution of the sample mean is approximately normal.
Sample n values
(has to be large
enough).
Calculate x-bar.
Now, this one
value belongs to a
population of all
possible means
(all have the same
sample size).
The distribution of all
possible x-bars is
approximately
normal.
Summary
Let the random variable X measure the time required for a person to go from home to work in a large city.
Suppose that the mean time required for a person to go from home to work is 45.3 minutes (µ = 45.3
minutes this is the population mean), with a standard deviation of 7.3 minutes (σ = 7.3 minutes is the
population standard deviation).
Population of interest. The variable X
represents the measurement of interest.
To estimate µx a sample of size n is gathered
and x-bar is calculated. This one value
belongs to a distribution of all possible values
of x-bar created by averaging n values at a
time.
The parameter of interest is the
mean of this population; 45.
Notice the symbol used to
represent this number,
.
Notice the symbol change
for the sampling distribution
of the mean.
Summary
Let the random variable X measure the time required for a person to go from home to work in a large city.
Suppose that the mean time required for a person to go from home to work is 45.3 minutes (µ = 45.3
minutes this is the population mean), with a standard deviation of 7.3 minutes (σ = 7.3 minutes is the
population standard deviation).
For a large enough sample size,
even if the distribution I sample from is
not normal, the distribution of sample
averages will be approximately
normal.
If our sampling method is not
biased, then
=
Equality is not automatic, the sampling method will make
this happen.
Summary
Let the random variable X measure the time required for a person to go from home to work in a large city.
Suppose that the mean time required for a person to go from home to work is 45.3 minutes (µ = 45.3
minutes this is the population mean), with a standard deviation of 7.3 minutes (σ = 7.3 minutes is the
population standard deviation).
The variation of X is measured by x.
The variation of
is measured by

X  X
n
What size sample would you want to estimate a mean value? A sample size of 50, n = 50,
The
equation,
that
variation
relationship
or
a sample
size of
n =defines
500? the
Assume
both samples
have between
no bias. the two variables
suggests that at as n increases, the variation of the distribution of x-bar
decreases.
Most
people would say 500. Why? They recognize it should be more accurate! What does
accurate mean? Less variability!!!!!!!!!!!!! Think about it. An accurate ball thrower can get the
ball to the target consistently.
The Central Limit Theorem makes a statement about the idealized
distribution (model) of statistics measuring centers. It says that if our
sample size is large enough then the distribution of averages is
approximately normal, even though the distribution we are sampling
from is not normal.
What occurs if the population we are sampling from is normal? Then
the sampling distribution of means is exactly normal regardless of
sample size.
Furthermore, we now know that if the sampling method is unbiased,
then the centers of the distribution we are sampling from, and the
sampling distribution of the mean are also equal.
=

X
Lastly, the relationship between the standard deviations is  X 
n
This completes our idealized distribution, the realization of what
sampling distributions of means has to look like. Armed with this, we
are no longer as blind to what our data may be trying to tells us.
This concept also extends to sample proportions and population
proportions as we shall see in the next lesson.
View this as many times as you need to get a good handle on this
topic. Knowing this concept well will help you with the problems you
will encounter.
The
End