Download Lab 3 for Math 17: Normality, Standardization, and Design/Sampling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Lab 3 for Math 17: Normality, Standardization, and
Design/Sampling
1
Sampling with Toy Soldiers
Reference command in R/RCommander: sample(1:n,m) will generate m random numbers from 1
to n. You can change m and n to whatever you like. If you enter this command in the script
window, you will need to highlight it and hit submit before it will give you the values.
How tall is a toy soldier, on average? To answer this question, each group will receive a bag of either
police figures (blue) or firefighter figures (red). We want to look at ways of sampling using the toy
soldiers and compare the sample mean estimates we get from each sampling procedure. There are
supposed to be 48 figures per bag, and you may treat your bag as the population of interest. You
may take any sample size you like as long as it is not greater than 8 figures for each method (truly,
these samples are too large for the population size, but the populations are very small).
1. How would you obtain a simple random sample of figures? Take an SRS and obtain a sample
mean for the height of your figures.
2. How would you obtain a stratified random sample of figures? What would the strata be? Take
a stratified RS and obtain a sample mean for the height of your figures.
3. How would you obtain a cluster sample of figures? What clusters would you use? Is it appropriate to treat the strata above as the clusters? If possible, take a cluster sample and obtain a sample
mean for the height of your figures.
4. How would you obtain a systematic sample of figures? What listing of figures would you use?
What value of k would you choose? Take a systematic sample and obtain a sample mean for the
height of your figures.
5. Take a convenience sample and obtain a sample mean for the height of your figures.
6. What sampling method do you think does the best job in this context to get a random, representative sample from the population?
1
2
Capture-Recapture Sampling
1. This is a repeated sampling process, requiring at least 2 samples with usually a pre-determined
length of time between samples.
2. You take the first sample (which may or may not have fixed size), and you somehow mark each
subject in the sample, then release the subject (fish, bird, etc.). Marking should be done in such a
way that it does not affect survivability of the subjects in any way.
3. Later, you take a second sample, and count the number of marked subjects in the second sample
(i.e. the number in the second sample that were also in your first sample).
4. Claim: the proportion marked in your second sample should be roughly equal to the proportion
of the whole population that you marked as the first sample assuming both samples are SRS and
there wasn’t too much disturbance to the population between the first and second samples (e.g. it
is a complication if marked subjects die, but adjustments can be made in practice).
So, there are four quantities of interest: N = estimate of population size, M = number marked in
first sample, C = number captured in second sample, and R= number in second sample which are
marked, and the presumed relationship is M/N = R/C or C/N = R/M .
Solve for N in terms of the other three quantities:
Let’s try it. Say your first sample consists of 200 fish which you tag and release. A year later,
you take another sample and catch 120 fish, of which 12 are marked. What is your estimate of the
population size?
How about if your original sample was 60, and 2 weeks later you catch 100 of whom 5 are marked?
What would your population estimate be?
Do you think the first or second sample is usually larger? Why? Are there any possible issues with
one sample being a lot bigger than the other?
3
Trying Out Capture-Recapture Sampling
We are going to examine Capture-Recapture sampling for ourselves to see how it works and how
reliable it might be. Each group (8 groups) will receive the following materials: cheddar population
of goldfish, marked replacements (the colored goldfish), sampling implements (dixie cups and brown
bags), and napkins. Groups one, two, three, and four will only mark one cup of goldfish each, while
groups five, six, seven, and eight will mark two cups worth each. All other directions are below.
You may eat unused goldfish, but don’t accidentally eat a population member! Your group must
decide what to do about partial goldfish. The questions will guide you to record R,C, and M , at
each relevant step.
2
To start: Empty your cheddar goldfish population into the brown bag. Also, write down your group
number so you “mark” the correct amount. Group number:
1. In your group, estimate the population size (you may not use serving info from the bag).
Our estimate of the goldfish population is:
2. Based on your group number, capture the specified “amount” of goldfish from the bag, and
count how many that actually was. Then, replace those goldfish with the marked goldfish. How
many goldfish did you mark? You may place captured goldish that were replaced back in the
original bag to snack on.
3. Shuffle up the population in the sampling bag (you can determine how best to do this).
4. Sample one cup worth of goldfish. How many did you catch? How many are marked? What is
your estimate of the population size from this sampling? Return those goldfish to the population.
5. Sample two cups worth of goldfish. How many did you catch? How many are marked? What is
your estimate of the population size from this sampling? Return those goldfish to the population.
6. Sample three cups worth of goldfish. How many did you catch? How many are marked?
What is your estimate of the population size from this sampling? Return those goldfish to the
population.
7. Determine the actual size of the population of goldfish you had. How accurate were your
results using capture-recapture sampling?
Class discussion notes:
Ideally, we would have repeated some of the sampling so you all could see the variability in the
estimates, but due to time constraints we could not do that. However, capture-recapture sampling
is a technique that has been extensively studied, with various modifications available depending on
what you are studying, and has been used to estimate population sizes (and related quantities) for
everything from zebras and tigers to human drug addicts.
3
4
Normality and Standardization
Attempt all problems on your own, then you may discuss with those around you.
1. Normal distributions are characterized by these two values (names and notation):
2. Saying a distribution is standard normal means what?
3. Which of the following are not valid normal distributions? Why?
a. N(0,6)
b. N(50,.0001)
c. N(-30,-10)
d. N(-200,60)
4. True/False. The empirical rule is valid for all distributions for a quantitative variable.
5. Suppose a test results in normally distributed scores with a mean of 75 and standard deviation of 12. Compute a z-score for Betty’s score of 84.
6. What is the probability a student scored 75 or higher?
7. What is the probability a student scored 84 or higher? (if you are interested in learning
how Rcmdr can compute this so you can check your answer from the table, just ask)
8. The top 25 percent of students scored above what value?
9. Give an approximate interval for the values scored by the middle 68% of students for this
test.
10. Another test taken by 16 students results in a sample mean of 40 and sample standard deviation
of 5. Give appropriate notation for each statistic related to the test. Then, interpret this standard
deviation.
11. Assume the new test also has normally distributed scores and treat the mean and standard
deviation provided as µ and σ. What score on the second test is equivalent to Betty’s 84 on the
first test?
12. Name one of the required principles of experimental design.
13. Describe the difference between stratified and cluster sampling.
4
5
QQ Plots
We took a look at some QQ plots in class, but now you will get to sample from distributions
yourselves and see the variety you can get in plots from both normal distributions and other distributions.
1. To begin, we are going to sample from the normal distribution. In Rcmdr, go to the Distributions menu, select Continuous distributions and then select Normal distribution. From those
4 options, select the last one - sample from a normal distribution. A new window will open up.
Leave the mean and standard deviation alone for now, but change the number of samples to 50 (50
observations), and observations to 1 (this is the number of variables). Be sure to uncheck the option
to add the means to the data set. Then click Ok. Now, under Graphs, select Quantile-comparison
plot. Only one variable is available, and we do want to compare it to the normal distribution, so
just click okay. The plot opens in a second window. How does this QQ plot look to you? Would
you say that this sample came from a population which was normally distributed? Explain.
2. Sample again from the normal distribution, but change the mean and standard deviation to
something else. Generate a QQ plot for your new data. Based on the plot, would you say that this
sample comes from a population which is normally distributed? Why?
3. Generate 50 rows of data from a t distribution (still under continuous) with 10 degrees of
freedom. Generate a QQ plot for your new data. Based on the plot, would you say that this
sample comes from a population which is normally distributed? Why?
4. Try again with a t with 1 degree of freedom and a t with 100 degrees of freedom. Which t
distribution looked the least normal? Which looked the most normal? How could you tell?
5. Finally, repeat the process for the F distribution with 6 numerator degrees of freedom and
23 denominator degrees of freedom. Generate a QQ plot for this data. Based on the plot, would
you say that this sample comes from a population which is normally distributed? Why?
6. Real data: Open the trees.dat data set from last week’s lab. Generate a QQ plot and histogram for the data set. Describe what you see in both plots. Would you say that this sample of
trees comes from a population which is normally distributed? Do you see any “edge” effects? If so,
why might those be occurring?
5
6
To Turn In
Write brief (2-3 sentences per question is fine!) answers on a separate sheet to the two questions
to turn in.
1. You are thinking about hosting a Halloween party in a few weeks for your friends. You have two
possible venues. At the first, party costs follow a normal distribution with mean 250 and standard
deviation 16. At the second venue, party costs follow a normal distribution with mean 235 and
standard deviation 25. If you plan to spend 240 dollars on the party, is that a more ‘unusual’ party
for the first or second venue? Explain in one sentence. If you have a maximum of 260 dollars to
spend without going overbudget, which venue would you choose and why?
2. You are assisting in an experiment on tomato plants, but first, you are helping select the
sample of plants to use in the experiment. There are 64 pallets of basically identical plants available (i.e. all roughly same age, species, size). Each pallet has 16 plants arranged in 4 rows of 4
plants each. Suppose you want a sample of 32 plants for the experiment. Describe a sampling
strategy to get your 32 plants that will result in a representative sample from the population (all
64 pallets).
6