Download Analysis of Environmental Data Problem Set Conceptual Foundations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Analysis of Environmental Data Problem Set
Conceptual Foundations:
No n p aram e tric In fe re n c e : O rd in ary Le as t Sq u are s & Mo re
Answer all three questions:
1. Consider a hypothetical study of wood frog larval
abundance in vernal pools. Let’s say that you sample
10 vernal pools by taking 10 dip net sweeps of the
water column in random locations throughout each
pool and record the presence/absence of wood frog
tadpoles in each sweep. The raw data are given here as
the number of successful sweeps for each of the 10
pools, and displayed in the accompanying histogram:
4233643526
Let’s say that you are willing to assume that the
probability of tadpole capture (prob) is the same for
every sweep. Answer the following questions:
a. Without having to explicitly assume a probability
distribution for this data, what is the ordinary least squares estimate of the average per trial
probability of success (prob)? Alternatively, can you show that the sample mean proportion of
successes is in fact the estimate that results in the minimum sums of squares error?
b. Construct a 95% parametric confidence interval for the mean above, based on the following
equation:
Note, sp is the standard deviation in p (proportion of successes), n is the sample size, and the
ratio of the standard deviation in p to the square root of n is the standard error of the mean. The
confidence interval is derived by taking the sample mean and adding to it and subtracting from it
the standard error of the mean times the t-statistic for the desired confidence level and degrees of
freedom. In this case, since we are interested in a 95% confidence interval we must find the
value of the t-distribution corresponding a p-value of 0.025, since we have to account for both
the lower and upper interval, and the degrees of freedom are equal to n-1.
c. The 95% confidence interval constructed above is symmetrical about the sample mean, owing to
the use of the t-distribution which is symmetrical like the normal. Let’s say that you suspect that
the sampling distribution for the test statistic, mean proportion of successes, is not normally
Nonparametric Inference: Problem Set
2
distributed and thus not symmetrical. Briefly, how would you go about constructing a
nonparametric confidence interval that allows for asymmetry?
2. Consider a hypothetical study on density dependence in populations of purple people eaters. Let’s
say that you sample 10 breeding populations and measure breeding population density (#breeding
individuals/area) and fecundity (#young produced/breeding female). You hypothesize that
fecundity will rise at first with density due to social facilitation of breeding (e.g., the likelihood of
finding a mate) but then decline due to increasing competition as the population density increases
further – it is known that purple people eaters become canabalistic at high densities. The raw data
are given here and displayed in the accompanying scatterplot.
population density fecundity
1
0.66 11.45
2
7.95
9.05
3
4.94 12.79
4
5.54 15.57
5
2.44 12.70
6
1.32 17.65
7
7.06
6.29
8
6.44
9.51
9
9.48
5.29
10
0.17
3.03
Let’s say that you want to fit a model in
which fecundity is expected to vary with
density according to a Ricker function, but
that you are unsure or unwilling to assume
a particular error model. Answer the
following questions:
a. Using least squares as your goodness-of-fit criterion, which of the sets of parameters for the
Ricker model shown in the scatterplot provide the best fit? The Ricker function is given in the
figure. Note, you will probably want to use a spreadsheet to do the computations, or you can try
your hand at doing it in R if you are in lab.
b. Let’s say that you want to test whether the above model is significant; i.e., the probability of
observing this data and these parameter estimates solely by chance if in fact there was no
relationship between density and fecundity. You decide to conduct a nonparametric
randomization test of the F-ratio statistic, which is the ratio of the explained to unexplained
variance in fecundity. See the lecture notes on how to calculate the F-ratio, but note that here
the deterministic model is a Ricker function of density and not a linear function of density.
Otherwise, the calculations are the same. Note, as the F-ratio increases, more of the variance is
explained by the model. But how much variance do we need to explain before we can call it a
‘significant’ amount? To answer this question, we can simulate datasets by resampling our
original dataset (to preserve the data domain) but removing any real relationship between density
and fecundity by shuffling one of the variables. For each random permutation of the original
Nonparametric Inference: Problem Set
3
dataset, we use ordinary least squares to estimate the values of our model parameters, a and b –
this represents our best fit of the Ricker model to the randomized data. Then we compute the Fratio to determine the ratio of explained to unexplained variance. Note, here it is possible that
the null model (a simple intercept only model) explains more of the variance than the best
Ricker model. If this happens, the computed F-ratio will be negative – make sure you
understand why. Since we only care about how often the Ricker model might do better than the
null model by chance, we will set any negative F-ratio to zero. If we conduct 1,000 permutations
of the dataset and compute the F-ratio for each, we can plot the expected distribution of the Fratio under the null hypothesis. This
is exactly what we need in order to
compute a p-value for our original
observed F-ratio, or to reach a
decision as to whether to “reject” or
“fail to reject” our null hypothesis
under a Neyman-Pearson decision
framework, right ? The
accompanying figure shows the
empirical cumulative probability
distribution for the F-ratio under the
null hypotheses as derived from the
randomization procedure. Based on
the information provided, do you
‘reject’ or ‘fail to reject’ the null
hypothesis at an alpha=0.05? Note, to
answer this question, you will need to
compute the F-ratio for the original
dataset using the best set of
parameter values from (a) above.
3. Consider a hypothetical study on the effects on mowing and burning on the control of an invasive
non-native plant that is out-competing a rare native plant. Let’s say that the native plant you are
interested in is a rare annual forb that grows in early successional openings in the forest, but its
success is being jeopardized by an aggressive invasive non-native plant that invades forest openings
and out-competes the native forb for resources. Let’s say that previous surveys have identified 20
forest openings that at least historically contained the rare native plant, albeit at very low numbers.
Let’s say that you are interested in evaluating two alternative management treatments, mowing and
burning, to see if they have a beneficial effect on the native plant’s ability to compete with the
invasive species and whether mowing and burning have a comparable effect. Let’s say that you set
up a manipulative field experiment. First, you conduct a systematic survey of each of the openings
during the peak growing season for the native plant and count the number of individuals on each
plot. You find that the openings contain anywhere from 0 to 5 individuals of the native plant. Next,
you randomly assign the 20 openings (experimental units) to one of the two treatments (mow versus
burn) and implement the treatments as governed by the study design. Next, you resurvey the sites
the following year after treatment and count the number of individuals of the native plant. To
control for the effect that site may have on the response, you compute the difference between the
post-treatment and pre-treatment counts on each site. A positive difference indicates that the count
Nonparametric Inference: Problem Set
4
increased following the treatment; a negative difference indicates that the count decreased following
the treatment. The raw data are given here and displayed in the accompanying barplot::
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
treat pre post diff
mow 3 6 3
mow 4 8 4
mow 1 1 0
mow 0 3 3
mow 5 7 2
mow 3 8 5
mow 3 9 6
mow 2 3 1
mow 5 8 3
mow 3 8 5
burn 3 2 -1
burn 2 1 -1
burn 0 0 0
burn 2 3 1
burn 3 3 0
burn 5 9 4
burn 5 7 2
burn 5 4 -1
burn 1 0 -1
burn 2 4 2
One of the questions you are interested in is whether the change in count (up or down) differs
between the two treatments. To answer this question, you define the following test statistic D. First,
you compute the median pre-post treatment difference for each of the treatments. Then you
compute D as the difference between these two medians. If D60, the treatments have a similar
effect on the native plant. If D deviates from 0, it indicates that the treatments differ in their effects
on the native plant. Moreover, if D>0 it indicates that mowing has a greater positive effect on the
native plant than burning, and vice versa. D=3 for the dataset above. Thus, it would appear that not
only do the treatments differ in their effects, but that mowing has a greater beneficial effect than
burning. The question you need to answer is how large does D need to be before you can say that
the treatment effects significantly differ?
a. How would you use a nonparametric bootstrap to answer this question? Describe the approach
for constructing the bootstrap distribution and how you would use it to establish statistical
significance.
b. How would you use a randomization test to answer this question? Describe the approach for
constructing the random permutation distribution and how you would use it to establish
statistical significance.