Download ubc biology 300

Biology 300 Page 75 Lab Exercise # 9 11. RELATIONSHIPS BETWEEN TWO VARIABLES Previous to this exercise, you have seen only one statistical analysis that works on two variables: the contingency test. This test has the advantage that it will work with very low level, categorical data. Unfortunately, because it requires little information and has few assumptions, it can only provide very basic answers. The contingency test will merely show whether or not two variables are independent. This week we will look at two more powerful analyses: simple linear regression and simple linear correlation. Both of these are parametric tests and have some fairly rigorous assumptions that must be met. If we can meet these assumptions, correlation and regression can provide us with detailed information about the relationship between two variables. Of the two, regression is used to describe a much more powerful relationship, one in which changes in an independent variable (X) cause changes in a dependent variable (Y). Regression allows us to predict Y if we know X. On the other hand, correlation is a way to describe two variables that are associated with each other, that vary together. It has no predictive ability and does not imply causation. Warning: carrying out a regression analysis and finding a significant relationship does not necessarily mean that you have found a cause and effect relationship. Demonstrating a cause and effect relationship also requires proper experimental design, with appropriate controls to rule out other possibilities. Simple Linear Regression Linear regression is a useful tool for describing a causal or predictive relationship between two variables using a straight line. The line takes the form, Y =  + X, where X is the independently measured variable used to predict the dependent variable Y,  is the Y intercept (the mean value of Y when X is zero), and  is the slope of the line (the amount that Y changes per unit change in X). Note that the equation of the line you learned in first year physics is different from this, so forget about that equation for now. In lecture, you have learned, in order, how to estimate these parameters, how to test hypotheses about them, and finally, how to evaluate whether data satisfy the assumptions of regression and how to remedy problems. These steps are done in reverse order when you use a computer to analyse data, since calculations are carried out so quickly. The best way to proceed is to: (1) Evaluate the assumptions and take remedial action when necessary. (2) Test hypotheses about parameters only after the assumptions are satisfied. (3) Concern yourself with the equation of the line and its biological significance only if there Page 76 Biology 300 is a statistically significant relationship. In this lab, we emphasise graphical tools that help you evaluate the assumptions that underlie regression analysis. These methods rely on visual and statistical inspection of data, and conclusions that you draw while using them will be subjective. Your goal is to try to make the data fit the assumptions as closely as possible, and then decide whether the fit is close enough to proceed with the analysis. Be prepared to try several remedies and to choose the best among them. Assumptions of Regression Linear regression rests on four assumptions. (1) The true relationship between the two variables is actually a straight line. (2) At any value of X, there is a distribution of possible Y values whose mean falls on the line. The variance of these Y values is assumed to be the same at all values of X. (3) The distribution of possible Y values at each X is normally distributed. (4) Observations are independent of another. The assumption of independence can be met only by proper experimental design. The information below suggests a number of transforms that may work in particular situations. In general, though, log or ln transforms are the most likely transform to correct a problem with one of the assumptions of regression. 1. Linearity - Scatterplots of Y against X are useful for determining whether the relationship between two variables is linear. By adding the regression line to a scatterplot, you provide a straight reference line against which deviations from linearity can be compared. You might also see how well an alternative curve that is not constrained to be linear fits the same data. Spline fitting is a technique that plots the trend in Y against X without assuming that the relationship is a straight line. If the spline fit closely mimics the linear regression, then the assumption of linearity is a safe bet. If not, then something must be done to the data before regression can be used. When the computer fits a spline to a set of data, it generates a series of polynomial regressions (curved lines) using subsets of the data, and then smoothes these curves together to demonstrate the general trends in the data set. If the spline fit tracks roughly alongside a straight line, and doesn't show a systematic and regular deviation from the straight line, then it is likely that the relationship between X and Y is linear. When you carry out a spline fit, you must select a value of lambda, which specifies how straight the spline will be. A lambda of between 100 and 1 will usually allow the spline fit to wander enough to determine if the relationship appears to be linear. Be prepared to experiment with different spline values. Suppose that the relationship between Y and X is not linear. Transformations offer no hope if the real curve contains either a distinct peak or a distinct valley (for instance a bimodal distribution). However, they might work if the curve increases uniformly or decreases uniformly from left to right. Log and square root transformations can be used to linearise relationships. Try both transformations and use the one that works best. You also have a choice of transforming either X or Y. Your choice of variable depends upon the shape of the curve. Transform Y if the data might be linearised by pulling down extremely large values of Y. Transform X to pull down large values of X (see figure below). In some cases, the Biology 300 Page 77 relationship between X and Y will be clearly nonlinear, but transformation pulls down the large values of a variable so far that the curve is bent in the opposite direction from the original curve. For example, the curve in the upper left panel of the figure below is bent into the curve in the upper right panel. This result often occurs when the true relationship between two variables is a power function (e.g. Y = 3 e1.4X), and it can be remedied by transforming both X and Y. 2. Equality of variance - It is difficult to test this assumption statistically, because we generally only have one or two values of Y for any X, preventing reasonable variance calculations. Instead, we usually rely on graphical analysis by viewing the spread of data points around the regression line. If the variance is constant, the spread should remain roughly constant along the line. The spread should change inconsiderably if the variance changes, becoming small if the variance decreases and large if the variance increases. A simple scatterplot of Y against X together with the fitted regression line is useful for this purpose, but it has the disadvantage that the viewer must look at the spread around a sloping line, which is not as easy as viewing the spread above and below a horizontal line. Because our goal is to observe the spread in Y relative to the predicted values of the line, the effect of the slope can be removed by first calculating the difference between Y and the predicted value Y. This difference (Y - ) is known as the residual. A scatterplot of residuals against X should yield a horizontal band of data points with constant height if variance is indeed constant. Transformations to equalize variance must be applied to Y. Both the log and square root transformations are available. Try both to see which does a better job at equalizing variance. A third transformation is available when analyzing proportions. Proportions represent a special case because variance is expected to rise as p increases from 0 to 0.5 and then fall as p increases from 0.5 to 1. The arcsine-square root transformation is appropriate under these circumstances. 3. Normality - The assumption is that Y is normally distributed about the predicted value of Y at any value of X. This assumption is the least important of the assumptions of regression analysis, and we will not discuss it further. Note that most transformations of Y that linearise data or control variance also tend to normalize data. Use large sample sizes, and don't worry about this assumption if you've got the others down. Testing the Significance of a Regression Relationship Generally, regression analysis is only done with parametric data. There are some fairly quantitative non-parametric approaches, but they are not widely used. Since we are trying to demonstrate a very rigorous relationship (causal), we need rigorous experimental designs and analyses. We need parametric data. There are two main ways to test for the significance of a regression relationship. The simplest Page 78 Biology 300 (at least by hand) is the F test, which just tells us if the slope of the line is significantly different from zero. A zero slope or flat line means that no matter how X changes, Y is unaffected. Therefore, a significant slope means that X is causing changes in Y. To do an F test we will carry out an ANOVA in which we act as if each value of X represents a different sample or treatment for Y. If the mean Y values for each X are the same, then the line is flat and there is no relationship. The calculations of the intermediate components of the regression ANOVA are slightly different than for a typical ANOVA, simply because we usually have only one or two values of Y for any X. The theory behind the analysis is identical, however. As usual, the ANOVA is a two-tailed test, but we calculate a one-tailed F, since we assume that the regression variance is bigger than the residual variance. If we want to test for a positive slope or a negative slope or a non-zero slope, we must use a different approach. A t test of the significance of  allows us to carry out one-tailed, twotailed or non-zero tests. Describing the Regression Relationship If we find that there is a significant relationship, then we usually would like to describe the relationship. The equation of the line is our best description of the relationship between X and Y. We can use this equation to predict values of Y from X, and for fixed effects regression models we have a limited ability to predict X from Y (inverse prediction). We must be careful not to extrapolate beyond the range of our data, however. Prediction from regression is only valid within the range of our data, since this is the only area where we have tested our assumptions (e.g. linearity). If we have transformed one or both of our variables we must include that transform when we report the equation of the line. If we had transformed Y to ln (Y + 0.5) to solve a variance problem we would report the equation as ln (Y + 0.5) = a + bX. Another useful tool for describing the relationship is r2, the coefficient of determination. This value tells us the strength of the relationship, or what proportion of the change in Y can be predicted from changes in X. r2 varies between 0 (no predictive power) and 1 (the equation of the line explains all of the changes in Y). Finally, we can produce confidence intervals for our regression line. These intervals give us an indication of the predictive accuracy of our line. The intervals are tightest at mean X and mean Y since our equations force the line to go through this point. As we move further away from the mean pair of values, our uncertainty about the position of the line increases and our intervals become larger. These intervals are produced by summing our uncertainty about the intercept and our uncertainty about the slope. Simple Linear Correlation In correlation analysis, values of X are not preselected by the investigator, but occur at random. Here we are not trying to demonstrate a causal or predictive relationship. We are merely trying to see if two variables are associated or vary together. Thus, correlation does Biology 300 Page 79 not designate one variable as dependent and the other as independent. The unfortunate result of this relationship is that correlation has many assumptions: 1. For each value of Y there is a normally distributed population of X values. 2. For each value of X there is a normally distributed population of Y values. 3. For each value of Y there is equal variance in the population of X values. 4. For each value of X there is equal variance in the population of Y values. 5. Both Y and X values were selected randomly and independently. 6. The relationship between X and Y is linear. It is hard to meet these assumptions and can be even more difficult to demonstrate that we meet the assumptions. We are usually satisfied if the trend in our data set is a symmetric, oval cloud of data points whose slope angles upwards or downwards. In a correlation analysis, the strength of the relationship between two variables is indicated by r, the correlation coefficient. This value is mathematically equivalent to the square root of r2 in regression analysis. It should not be considered equivalent, however, because of the theoretic differences underlying correlation and regression. The correlation coefficient may take on values from -1 (negative linear correlation) to +1 (positive linear correlation). A value of 0 indicates no linear correlation. We can also produce confidence intervals on r. If r is near -1 or 1, however, the intervals will not be symmetric and a normal distribution is not adequate to describe the data. The significance of a correlation is tested via a t test of r. As usual, the t test is a mean difference divided by the standard error of the mean difference. In this case, t equals the difference between our calculated r and any hypothetical value of rho (the parameter for the relationship between X and Y), divided by the standard error in r. Non-parametric correlation is fairly common since the assumptions of correlation are sweeping and difficult to test. Generally, we can examine the scatterplots to find evidence of a non-linear relationship or violations of the assumption of equal variances among the samples. Spearman's or Kendall's rank correlation are the most common non-parametric equivalents to correlation. They have similar power and assumptions and one of the two should be used if the assumptions of parametric correlation do not seem to be met, and sample sizes are small. Using the Program Beginning a regression analysis - To carry out either regression or correlation analysis we need two continuous variables. To carry out a regression, choose fit Y by X and designate the causative or predictive variable as X. Your Y variable should be the variable that is the Page 80 Biology 300 effect, which is being predicted. The computer will display a scatterplot of the data. It may be possible to see trends in the data just by examining this plot. If you click on the fitting button, you will be presented with a number of options. First, choose fit line. This will carry out a simple linear regression. We won't worry about the summary tables for this line just yet. For now, we want the regression line for its use in determining if this association is linear. Once we have a fitted line displayed, clicking on the reveal button (the arrowmarked button) for the particular line will allow us to plot confidence intervals for that line. Checking linearity - To test if the data are linear, carry out a spline fit from the same fitting dialog box, with a lambda of about 1 or 10. Go higher or lower if these values show too little or too much wobble and remember that we are just trying to see if the general trend of the line is straight. This curved line should track more or less alongside the linear fit. We will worry if there is a strong trend away from the data. The diagrams above show us how we can correct some of the more common problems in linearity. The summary table produced when we carry out a spline fit show the r2 value and sum of squares error for the spline curve. These can be compared to the linear values to see if there is much improvement when we have a non-linear relationship. If a straight line is a good description of the relationship between X and Y, the spline values will be similar to the linear values. Use the spline r 2 only for this purpose, though. Since r2 is an indicator of how tightly values cluster around our line, we can increase r2 to near 1 if we let the line wobble enough. Checking variance -To test if the variances in Y are equal, we can examine a residual plot. If the residuals form a fairly symmetric band across the graph, then our variances are roughly equal. If the scatter of points around the residual line looks like the spray from the end of a hose (variance in Y increases as X increases), we need to transform Y. In fact, any shape other than an equal band suggests we should transform Y. To call up a residual plot click on the reveal button for our linear fit. Transformations - generally, log (common or natural) transforms are the safest bet for data sets which violate the assumptions of regression. There are two ways to transform data for a regression analysis. The simplest way is to select fit transformed from the analysis dialog box. This calls up another menu, where we can choose an appropriate transform for X or Y. Unfortunately, this method of transforming will not adjust the scale of measurement for the diagram of scatterplots and our assorted lines of fit. The axis scale on the residual plot will also stay the same. The result is that it is difficult to see if our transformation improved our ability to meet assumptions. The second and better method of transforming your data is to create new columns of data by applying a transform formula to our raw data. This is the same method we used in last week's ANOVA lab. This method will produce new scales for the axis of our scatterplots and residual plots, letting us judge whether the transform worked. Regression Statistics - When we carry out a linear fit, the program produces summary tables showing the ANOVA results, the equation of the line, the parameter estimates (a and b), the standard errors for our parameters, t tests for both the intercept and the slope, and the Biology 300 Page 81 value for r2. In the table of parameter estimates, our estimate of the intercept is a, and our estimate for the coefficient of X is b (listed as the estimate of X). The t statistic and p value for b, is a t test of slope. The probability shown is a two-tailed probability so you will have to adjust the p value if you want to carry out a one-tailed test. The appropriate p value will either be one half of the two-tailed value or 1 minus one half of the two-tailed value. You must work out which is appropriate, based on your hypotheses. Clicking on the reveal button for the parameter estimates table will provide a dialog box that lets you generate confidence intervals for the estimates. Correlation - To carry out a correlation, we can fit Y by X, designating either variable as X or Y (they are interchangeable here since we are not designating a dependent and an independent variable). Click on the fitting button from the resulting scatterplot and choose density ellipses. The probability we choose for our density ellipse is used to show a region in which we have that specified level of certainty that the bivariate normal (normally distributed in both X and Y) population of paired values should fall. The probability that we choose will not affect the statistics for the correlation. The summary statistics appear in the bivariate table. The values displayed here include r, and the probability of obtaining r (the p value from a test of the significance of r). A second way to carry out a correlation is to choose analyze, then correlation of Y's. Add the two variables you want to check for a correlation. This will produce a report including r, the correlation coefficient. Clicking on the check menu in the lower left of the box lets us do pairwise correlations, whose summary table includes r and a significance test of r. These values are identical to those produced by the density ellipse approach. Clicking on the check menu again, we can call up the correlations - nonpar that will conduct non-parametric rank correlation analysis. The resulting summary table displays the Kendall's and Spearman's test statistics and p values. Clicking on the reveal button for this table lets us see the values from a third non-parametric correlation test, the Hoeffding test, which is less frequently used than the first two tests. Problems 1. The original description of regression effects was from a study of parent heights and child heights conducted by Sir Francis Galton in 1877. The data from this study is stored in the file Galton.jmp in the shared directory. In this study parent height (X) is the predictor for child height (Y). a) Is a straight line adequate to describe the relationship between parent height and child height? Page 82 Biology 300 b) Does variance in child height remain constant across all values of parent height? c) Test the hypothesis that parent height can be used to predict child height. d) Why do the confidence bands for the regression curve out at their ends? Biology 300 Page 83 2. The nitrogen content of leaves (mmol/m2) is thought to be a predictor of photosynthetic rate (mmol/m2/s) and productivity of agricultural crops. Both of these variables were measured and stored in a file named NITRO. a) Examine the data and decide if the assumptions of linearity and equal variance are met for this data set. b) Suppose there is a problem with the assumption of equal variance. Transform the appropriate variable (by the method of adding a new column) and compare the residual plots to see if there is any improvement. Can we see this improvement if we transform the variable through the fit transform method? c) Is it possible to predict photosynthetic rate based on the nitrogen content of leaves? If so, describe the relationship algebraically. Page 84 Biology 300 d) How well does the relationship predict the photosynthetic rate? 3. Crows often steal food from birds of other species. To test whether potential victims are sensitive to the presence of potential robbers, an experiment was conducted in which a model crow was placed near the nests of pigeon guillemots. Observers recorded the amount of time that arriving guillemots sat and waited before delivering food to their hungry nestlings. A negative relationship between time waited and distance of the model from the nest would suggest that would-be victims respond to the risk of robbery. Data from the experiment are stored in the file ROBBER. Variable #1 is distance of the model crow from guillemot nests (m), and variable #2 is amount of time that guillemots paused (min) before returning to their nests. a) Is a transformation required to linearize the data? b) Is the assumption of equality of variances met in this example? Biology 300 Page 85 c) Test the hypothesis that pigeon guillemots are reluctant to approach their nests when a potential robber is close at hand. 4. Brain weight (g) and testes weights (mg) were measured in 8 male chipmunks (Tamias spp.). The data are stored in a file named CHIPMUNK. a) Examine and comment on a scatter plot of the data. b) Is there a relationship between brain weight and testes weight in these chipmunks? Show all steps taken in testing the null hypothesis. Page 86 Biology 300 c) Try testing the significance of the correlation using whichever method you didn't use in part b. Are the results the same? d) Suppose a significant relationship was found between the variables measured in this study. Is it possible to predict brain weight from measurements of testes weight in these chipmunks? e) What is the value of the correlation coefficient in this analysis. Do you feel that the correlation coefficient indicates a strong relationship between the variables measured in this study? f) Suppose the bivariate normality assumption of correlation analysis is not met in this study with its small sample size. What test would you use to determine the strength of the relationship between the variables measured? Try it (show all steps taken in testing the null hypothesis). Do you reach the same conclusion as before? If not, which conclusion would you agree with?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ubc biology 300