Download Commentary on Distribution of Sample Means Class Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Commentary on Distribution of Sample Means Class Notes
First & foremost – read your text! Chapter 7 – this will make all of your notes &
commentary make much better sense..
Just as a refresher, the first couple of rows in your Class Notes contain material already covered,
but helpful as we move ahead & tackle the subject of sample means.
Before we start performing hypothesis testing (inferential statistics, making statements about
populations based upon sample data) we are kicking it up a notch & moving closer to obtaining
the most accurate data necessary for making statements about populations based upon sample
data.
So far, the original data we have been working with since the beginning has been individual
values or scores. First, we worked with basic probabilities to learn not just how probabilities that
are known are obtained, but to discover how others use information & probabilities to predict the
unknown. Our last assignment, we learned to convert raw scores to standard scores for the
following reasons 1) Converting raw scores to standard scores, or z-scores gives us the same
mean, or average (0) & the same standard deviation (average distance of all the scores from the
mean) (+/-1.00). 2) Converting raw scores to standard scores allows us to compare dissimilar
data sets or values, & allows us to identify where exactly on the distribution each score lies.
Then we brought in the Unit Normal Table to combine probability & standard scores to again,
make predictions of where, on a distribution, certain scores may fall. Do you see the pattern
(hint: the bolded, italicized, underlined words)?
The next step now is to work with larger data sets so that we can then make even more informed
decisions about making predictions about a populations based on sample data. This is where
sample means comes into play. As we are making statements about population data using
sample means, it is assumed that the values are equal to population data (review the Central
Limit Theorem.) The two most important descriptive values we use in statistics is the mean &
standard deviation, so this is where we begin.
I will be using a viable example, but to make this a more simple display, I am reducing the
number of scores just so you can “see” the concept of obtaining sample means.
I want to evaluate the personality scores using the MMPI of schizophrenics who are in in-patient
facilities throughout the state. Instead of gathering all of the personality scores, I gather the
scores from each facility in the state. (Hypothetically) there are 5 in-patient facilities each current
with 5 schizophrenic patients to be tested. Their results are as follows:
N= 5
Facility 1
Facility 2
Facility 3
Facility 4
Facility 5
MMPI Scores
for each of
the 5 patients
(n=5)
(n=5)
(n=5)
(n=5)
(n=5)
45, 50, 66, 79, 33, 38, 44, 59, 55, 63, 72, 90, 58, 62, 65, 88, 40, 60, 87,
100
65
112
90
100, 120
Means
(M=∑X/n)
45+50+66+79
+100 = 340/5
= 68
(M1=68)
33+38+44+59
+65= 239/5
=47.8
(M2=47.8)
Mean of the Sample Means (µM
55+63+72+90
+112= 392/5
= 78.4
(M3=78.4
58+62+65+88
+90=363/5 =
72.6
(M4=72.6)
40+60+87+
100+120=
407/5 = 81.4
(M5=81.4)
= (∑M/N)) = 68 + 47.8+78.4+72.6+81.4 = 348.2/5 =
69.64
Notice that the formula used for computing a regular sample mean is exactly the same as when
we compute the mean of a sample mean (i.e., the average of all the sample means). What is
different are the notations that are used.
Since we are now working with a set of means instead of individual scores, then this means that
we need new symbols & formulas representing sample means instead of regular single x-values
if we are to calculate the probability of obtaining certain population values. Each sample mean is
represented as “M.” “N” represents the number of sample means that you are working with.
Remember that “N” represents population numbers where ‘n’ represents sample numbers. Since
our means are pulled from the full population that we are working with, then we use the capital
‘N’ as the value representing the number of sample means. Above, we obtained 5 sample means,
so our ‘N’ value is 5 (Also, each sample set had 5 values in it, so each sample set from each
facility would be n = 5)
Now we have what is called the “mean of the sample means” since we took each sample mean,
& then utilized the mean formula (with changes in our symbols) (∑M/N) to obtain the “mean” of
the sample means. This is also called the expected value of M.
To obtain the standard deviation of our new set of sample means, we also experience
modifications in formulas & symbols. When we calculate the standard deviation of sample
means, it is called the standard error of M. First, we must compute the population standard
deviation formula (which you already know) since our data represents the full population. But
we add an extra step. After we calculate the population standard deviation, we then divide its
results by the square root of N to obtain our Standard Error of M, or the standard distance of all
sample means from the center (the average; the mean). Review your class notes as well as the
table at the end of your class notes to review the new symbols & formulas associated with
sample means.
Using the example above, the standard error of M would look like this:
M2
M
68
4624
47.8
2284.84
78.4
6146.56
72.6
5270.76
81.4
6625.96
∑M = 348.2
∑M2 = 24952.12
(∑M)2 =
121243.24
24952.12 – 121243.24
____________5_____
5
= 24952.12 – 24248.648 = √703.472/5 = √140.6944 = 11.861467
5
We have our population standard deviation value = 11.861467. Since we are working with
sample means & not regular x-values, we need to add an extra step. We need to divide our SD
value by the square root of N.
11.861467/√5 = 11.861467/2.2360679 = 5.304609…
Our standard error of M, or σM = 5.304609…
Since our sample sets have different means & different standard deviations, our next step is to
convert these values to standard scores. The steps for converting to z-scores is the same as we
did last week, but the symbols have changed since we are now working with sample means. We
have the values then that we need for converting our sample means to standard scores or zscores. See the formula at the top of the second column.
M
68
ZM = M - µM
σM
68 – 69.64/5.304609 =
-1.64/5.304609 = -.309165
47.8
47.8 – 69.64/5.304609 =
- 21.84/5.304609 = -4.117174
78.4
78.4 – 69.64/5.304609 =
8.6/5.304609 = 1.651394
72.6
72.6 – 69.64/5.304609 =
2.6/5.304609 = .558005
81.4
81.4 – 69.64/5.304609 =
11.76/5.304609 = 2.216940
Before moving on with our next step, let’s look at our z-scores that we have listed above in
comparison to our standard curve (distributing each sample mean from a population will result in
a normal or near normal curve due to the rules re: distribution of sample means. This is also
represented in figure 7.7 on page 215 of the 9th edition):
More than +/- 2.00 standard deviations from the mean would be considered significant. All of
our z-scores above fall within the average range with the exception of 2. Our M = 81.4
corresponds to a z-score of 2.216940; just slightly out of the significant range on the positive
side. This mean comes from Facility 5. If you add up the % under the normal curve beyond
+2.216940, we have: 1.7 + .05 + .01 = 1.76% meaning that there is a 1.76% chance of obtaining
a mean of 81.4, which is rare. But we also have an outliar. Facility 2 reports a mean of 47.8
which has a standard score of -4.117174. As you can see on the distribution, the probability of
obtaining a mean score of 47.8 is 0.1%; far beyond what would be considered normal or average.
Do you see how all of this is coming together now? Keep in mind that the values we have used in
this example above are minimal in comparison to a more valid study where greater masses of
data are evaluated. The rule of large numbers indicates that the greater the number of sample
numbers used, the less variability there will be between sample & population data (makes sense,
right?). As the N value reaches 30 or more, the distribution is considered normal. This does not
mean that outliars will not present themselves, but it does mean that the probability of outliars
will be reduced. The mass of the values will join around the mean & then reduce as they move
away from the mean (as you can see in the normal curve).
We use sample means data to make predictions. So far, our steps involve:
1. Obtain sample values from the population to be studied.
2. Calculate the mean from each sample set.
3. Obtain the mean of the sample means (expected value of M).
4. Calculate the standard error of M (the standard distance of all sample means from the
mean of the sample means).
5. Convert sample means to standard scores (since each sample set has a different mean &
SD, we must convert scores so that they have the same mean & same SD to compare
these values).
6. These values are assumed to be equal to population data. Obtaining the requisite number
of values allows us to make predictions based upon the normal curve.
This is where the last part of each one of your assignment problems comes in. You already
worked on using standard scores or z-scores to obtain probabilities in the previous assignment.
So, using the values in your assignment, you will again be asked to obtain the probability of
obtaining a certain score.
Using our example above: “What is the probability of obtaining a sample mean of 78.4 or
greater? We already calculated the z-score of 1.651394 for this particular sample mean. So we go
to our z-score table in the back of our text & locate the z-score of 1.65. The z-score is positive &
the question utilizes the term “or greater,” which means we look to the right of the score. The
right of a positive value on the distribution is equal to less than 50% of the distribution. So this
tells us that we are looking for the proportion in the tail, which in our table gives us .0495 or
4.95%. Stating, that there is a 4.95% of obtaining a sample mean of 78.4.
Since the values are assumed to be equal to a population, we can still make a prediction based
upon a value that is related to our subject matter (in this case, Personality Scores) even though
the value may not be in our original sample mean list. For instance, what if we wanted to
determine the probability of obtaining a sample mean of 75 or less? 75 is not in our sample mean
set. All we would need to do in this case is to convert our sample mean M=75 to the standard
score by plugging it into the z-score formula:
75 – 69.64/5.304609 = 5.36/5.304609 = 1.010442
We find the z-score 1.01 in our Unit Normal Table. This value falls on the positive side of the
distribution. Since the question asks “less than,” then we know we are looking to the left of the
value. Therefore, we are working with all of the negative side plus 1.01 SD on the positive side
giving us more than 50% of the distribution covered. Thus, the proportion in the body. This gives
us a proportion of .8438. So there is an 84.38% chance that we will obtain a sample mean value
of 75 or less.
It is this logic that scientists, educators, politicians, sports organizations, etc… use to make
predictions about the future.
Notice the different levels where we were working with values in-between whole numbers.
Our sample means involved values to the right of the decimal. Our expected value of M was
a decimal value. Our standard error of M is a decimal value. All of our z-scores are
decimal values. If you do not utilize the values to the right of the decimal anywhere along
the line, it throws off subsequent calculations to the point where your outcome will be
slightly off, or even wrong. So mind your decimals, please. And remember we are working
with values in-between whole numbers!!
The rules re: distribution of sample means are listed in your class notes. First, under ‘application’
section in the row distribution of sample means.
1. The anchor, or cornerstone of the concept of the distribution of sample means is The
Central Limit Theorem: Provides a precise description of the distribution that would be
obtained if you selected every possible sample, calculated every sample mean, &
constructed the distribution of the sample mean. So, for any population with mean µ &
standard deviation of σ / N , & will approach a normal distribution as n approaches
infinity.
The value of the central limit theorem comes from 2 different facts: 1. It describes the
distribution of sample means for any population, no matter what shape, mean or standard
deviation. 2. The distribution of sample means approaches a normal distribution very
rapidly. So, by the time n=30, the distribution is almost perfectly normal. The Central
Limit Theorem identifies with the 3 basic characteristics that describe any distribution:
shape, central tendency & variability.
2. The next important rule is The shape of the distribution of sample means: Will be
almost perfectly normal if either of the following two conditions is satisfied: 1. The
population from which the samples are selected is a normal distribution 2. The number
of scores in each sample is relatively large, around 30 or more. However, increasing the
size of more than 30 does not produce much additional improvement in how well the
sample represents the population.
3. Third, there is The law of large numbers: States that the larger the sample size (n), the
more probable it is that the sample mean will be close to the population mean. There is an
inverse relationship b/t sample size & standard error. Bigger samples = smaller error;
smaller samples = bigger error. Because of this rule, if you have n = 1, then the standard
error & standard deviation are the same (σM = σ). So, the equation to the left satisfies the
following 2 requirements: 1) as sample size (n) increases (↑), standard error decreases
(↓). 2) When the sample consists of a single score (n=1), the standard error is the same as
the standard deviation (σM = σ).
Changes from then to now ..
1) Distribution of Sample Means. When engaging in calculations, so far your X-value has
represented a single score or a single value in a data set. For the distribution of sample
means, your X-value now becomes a mean from a sample pulled from the
population. So as to distinguish a single score (X) from a sample mean (M), we change
that symbol to “M” as indicated. However, due to the flexibility of the use of the symbol
of “X,” you may still see “X” as the symbol of a sample mean, so do not be surprised if
you still see “X” when referring to a sample mean. Just make sure that you review the
material surrounding this value so you know whether or not you are working with a
single score or a sample mean.
2) You will start to see “subscripts” here. So far, we have been working with “postscripts”
or “superscripts” which are the small values listed above & next to the value (such as
when we square a value, it is represented as 82. The small “2” at the top tells us that we
are multiplying the value by itself. If there were a “3” in that place, then that would tell us
we should multiply the value by itself 3 times, etc…). Subscripts are listed closer to the
bottom & next to the value instead of upper & next. Subscripts are most often used to
represent a particular group the value belongs to. It serves no mathematical purpose. For
example, if you see M1 = 131, then that tells you that 131 is the sample mean belonging
to sample group 1. M2 = 119 tells you that 119 is the sample men belonging to sample
group 2, M3 = 138 is the sample mean associated with the sample group 3, etc…
3) Your next step would then be to calculate the mean of this group of sample means (which
you would calculate exactly as you have with descriptive data). MM = ΣM / N. Notice the
new symbol representing the ‘mean of the sample means’ (also called the ‘expected value
of M’) (MM). As this value would be the same as if you were to calculate the mean from
each value from the full population, the ‘mean of the sample means’ is also represented
as a population mean: µM. Although we are using the new symbol µM to represent the
mean of the sample means, due to the central limit theorem, we can assume that the µM =
µ or the population mean.
4) Your next step is the standard error of M is the standard deviation of the mean of
sample means. You still need to first calculate the population standard deviation
(population instead of sample since our group of sample means are representative of the
full population), but then you add an extra step. You divide σ (population SD) by N .
Keep in mind that in this case, N represents the number of sample means you are working
with; not the total value of every number in your data set. See your class notes for an
example of calculating the mean of the sample means & then the standard error of M.
5) Look at the change in your Z-score formula in your class notes. Now, your numerator
represents the sample mean – the population mean (or the mean of the sample means)
divided by the standard error of M, or the standard deviation of the sample means. Since
the mean of the sample means would be the same as the population mean, you will often
see these symbols used interchangeably at this point (µM = µ):
ZM = M - µ
σM
6) Now that we have our z-score associated with our set of sample means & we are working
with a normal curve, we can then follow through with the same process for obtaining
probabilities as we did when we were working with a set of single scores. Then, we were
making a probability based upon a sample set. This time, we are making probabilities
regarding the full population. Review your notes & assignments previous re: obtaining
probabilities & using the Unit Normal Table.