Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysis of Environmental Data Problem Set Conceptual Foundations: No n p aram e tric In fe re n c e : O rd in ary Le as t Sq u are s & Mo re Answer all three questions: 1. Consider a hypothetical study of wood frog larval abundance in vernal pools. Let’s say that you sample 10 vernal pools by taking 10 dip net sweeps of the water column in random locations throughout each pool and record the presence/absence of wood frog tadpoles in each sweep. The raw data are given here as the number of successful sweeps for each of the 10 pools, and displayed in the accompanying histogram: 4233643526 Let’s say that you are willing to assume that the probability of tadpole capture (prob) is the same for every sweep. Answer the following questions: a. Without having to explicitly assume a probability distribution for this data, what is the ordinary least squares estimate of the average per trial probability of success (prob)? Alternatively, can you show that the sample mean proportion of successes is in fact the estimate that results in the minimum sums of squares error? b. Construct a 95% parametric confidence interval for the mean above, based on the following equation: Note, sp is the standard deviation in p (proportion of successes), n is the sample size, and the ratio of the standard deviation in p to the square root of n is the standard error of the mean. The confidence interval is derived by taking the sample mean and adding to it and subtracting from it the standard error of the mean times the t-statistic for the desired confidence level and degrees of freedom. In this case, since we are interested in a 95% confidence interval we must find the value of the t-distribution corresponding a p-value of 0.025, since we have to account for both the lower and upper interval, and the degrees of freedom are equal to n-1. c. The 95% confidence interval constructed above is symmetrical about the sample mean, owing to the use of the t-distribution which is symmetrical like the normal. Let’s say that you suspect that the sampling distribution for the test statistic, mean proportion of successes, is not normally Nonparametric Inference: Problem Set 2 distributed and thus not symmetrical. Briefly, how would you go about constructing a nonparametric confidence interval that allows for asymmetry? 2. Consider a hypothetical study on density dependence in populations of purple people eaters. Let’s say that you sample 10 breeding populations and measure breeding population density (#breeding individuals/area) and fecundity (#young produced/breeding female). You hypothesize that fecundity will rise at first with density due to social facilitation of breeding (e.g., the likelihood of finding a mate) but then decline due to increasing competition as the population density increases further – it is known that purple people eaters become canabalistic at high densities. The raw data are given here and displayed in the accompanying scatterplot. population density fecundity 1 0.66 11.45 2 7.95 9.05 3 4.94 12.79 4 5.54 15.57 5 2.44 12.70 6 1.32 17.65 7 7.06 6.29 8 6.44 9.51 9 9.48 5.29 10 0.17 3.03 Let’s say that you want to fit a model in which fecundity is expected to vary with density according to a Ricker function, but that you are unsure or unwilling to assume a particular error model. Answer the following questions: a. Using least squares as your goodness-of-fit criterion, which of the sets of parameters for the Ricker model shown in the scatterplot provide the best fit? The Ricker function is given in the figure. Note, you will probably want to use a spreadsheet to do the computations, or you can try your hand at doing it in R if you are in lab. b. Let’s say that you want to test whether the above model is significant; i.e., the probability of observing this data and these parameter estimates solely by chance if in fact there was no relationship between density and fecundity. You decide to conduct a nonparametric randomization test of the F-ratio statistic, which is the ratio of the explained to unexplained variance in fecundity. See the lecture notes on how to calculate the F-ratio, but note that here the deterministic model is a Ricker function of density and not a linear function of density. Otherwise, the calculations are the same. Note, as the F-ratio increases, more of the variance is explained by the model. But how much variance do we need to explain before we can call it a ‘significant’ amount? To answer this question, we can simulate datasets by resampling our original dataset (to preserve the data domain) but removing any real relationship between density and fecundity by shuffling one of the variables. For each random permutation of the original Nonparametric Inference: Problem Set 3 dataset, we use ordinary least squares to estimate the values of our model parameters, a and b – this represents our best fit of the Ricker model to the randomized data. Then we compute the Fratio to determine the ratio of explained to unexplained variance. Note, here it is possible that the null model (a simple intercept only model) explains more of the variance than the best Ricker model. If this happens, the computed F-ratio will be negative – make sure you understand why. Since we only care about how often the Ricker model might do better than the null model by chance, we will set any negative F-ratio to zero. If we conduct 1,000 permutations of the dataset and compute the F-ratio for each, we can plot the expected distribution of the Fratio under the null hypothesis. This is exactly what we need in order to compute a p-value for our original observed F-ratio, or to reach a decision as to whether to “reject” or “fail to reject” our null hypothesis under a Neyman-Pearson decision framework, right ? The accompanying figure shows the empirical cumulative probability distribution for the F-ratio under the null hypotheses as derived from the randomization procedure. Based on the information provided, do you ‘reject’ or ‘fail to reject’ the null hypothesis at an alpha=0.05? Note, to answer this question, you will need to compute the F-ratio for the original dataset using the best set of parameter values from (a) above. 3. Consider a hypothetical study on the effects on mowing and burning on the control of an invasive non-native plant that is out-competing a rare native plant. Let’s say that the native plant you are interested in is a rare annual forb that grows in early successional openings in the forest, but its success is being jeopardized by an aggressive invasive non-native plant that invades forest openings and out-competes the native forb for resources. Let’s say that previous surveys have identified 20 forest openings that at least historically contained the rare native plant, albeit at very low numbers. Let’s say that you are interested in evaluating two alternative management treatments, mowing and burning, to see if they have a beneficial effect on the native plant’s ability to compete with the invasive species and whether mowing and burning have a comparable effect. Let’s say that you set up a manipulative field experiment. First, you conduct a systematic survey of each of the openings during the peak growing season for the native plant and count the number of individuals on each plot. You find that the openings contain anywhere from 0 to 5 individuals of the native plant. Next, you randomly assign the 20 openings (experimental units) to one of the two treatments (mow versus burn) and implement the treatments as governed by the study design. Next, you resurvey the sites the following year after treatment and count the number of individuals of the native plant. To control for the effect that site may have on the response, you compute the difference between the post-treatment and pre-treatment counts on each site. A positive difference indicates that the count Nonparametric Inference: Problem Set 4 increased following the treatment; a negative difference indicates that the count decreased following the treatment. The raw data are given here and displayed in the accompanying barplot:: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 treat pre post diff mow 3 6 3 mow 4 8 4 mow 1 1 0 mow 0 3 3 mow 5 7 2 mow 3 8 5 mow 3 9 6 mow 2 3 1 mow 5 8 3 mow 3 8 5 burn 3 2 -1 burn 2 1 -1 burn 0 0 0 burn 2 3 1 burn 3 3 0 burn 5 9 4 burn 5 7 2 burn 5 4 -1 burn 1 0 -1 burn 2 4 2 One of the questions you are interested in is whether the change in count (up or down) differs between the two treatments. To answer this question, you define the following test statistic D. First, you compute the median pre-post treatment difference for each of the treatments. Then you compute D as the difference between these two medians. If D60, the treatments have a similar effect on the native plant. If D deviates from 0, it indicates that the treatments differ in their effects on the native plant. Moreover, if D>0 it indicates that mowing has a greater positive effect on the native plant than burning, and vice versa. D=3 for the dataset above. Thus, it would appear that not only do the treatments differ in their effects, but that mowing has a greater beneficial effect than burning. The question you need to answer is how large does D need to be before you can say that the treatment effects significantly differ? a. How would you use a nonparametric bootstrap to answer this question? Describe the approach for constructing the bootstrap distribution and how you would use it to establish statistical significance. b. How would you use a randomization test to answer this question? Describe the approach for constructing the random permutation distribution and how you would use it to establish statistical significance.