Download ubc biology 300

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Biology 300
Page 75
Lab Exercise # 9
11. RELATIONSHIPS BETWEEN TWO VARIABLES
Previous to this exercise, you have seen only one statistical analysis that works on two
variables: the contingency test. This test has the advantage that it will work with very low
level, categorical data. Unfortunately, because it requires little information and has few
assumptions, it can only provide very basic answers. The contingency test will merely show
whether or not two variables are independent.
This week we will look at two more powerful analyses: simple linear regression and simple
linear correlation. Both of these are parametric tests and have some fairly rigorous
assumptions that must be met. If we can meet these assumptions, correlation and regression
can provide us with detailed information about the relationship between two variables. Of the
two, regression is used to describe a much more powerful relationship, one in which changes
in an independent variable (X) cause changes in a dependent variable (Y). Regression allows
us to predict Y if we know X. On the other hand, correlation is a way to describe two
variables that are associated with each other, that vary together. It has no predictive ability
and does not imply causation.
Warning: carrying out a regression analysis and finding a significant relationship does not
necessarily mean that you have found a cause and effect relationship. Demonstrating a cause
and effect relationship also requires proper experimental design, with appropriate controls to
rule out other possibilities.
Simple Linear Regression
Linear regression is a useful tool for describing a causal or predictive relationship between
two variables using a straight line. The line takes the form, Y =  + X, where X is the
independently measured variable used to predict the dependent variable Y,  is the Y
intercept (the mean value of Y when X is zero), and  is the slope of the line (the amount that
Y changes per unit change in X). Note that the equation of the line you learned in first year
physics is different from this, so forget about that equation for now.
In lecture, you have learned, in order, how to estimate these parameters, how to test
hypotheses about them, and finally, how to evaluate whether data satisfy the assumptions of
regression and how to remedy problems. These steps are done in reverse order when you use
a computer to analyse data, since calculations are carried out so quickly. The best way to
proceed is to:
(1) Evaluate the assumptions and take remedial action when necessary.
(2) Test hypotheses about parameters only after the assumptions are satisfied.
(3) Concern yourself with the equation of the line and its biological significance only if there
Page 76
Biology 300
is a statistically significant relationship.
In this lab, we emphasise graphical tools that help you evaluate the assumptions that underlie
regression analysis. These methods rely on visual and statistical inspection of data, and
conclusions that you draw while using them will be subjective. Your goal is to try to make
the data fit the assumptions as closely as possible, and then decide whether the fit is close
enough to proceed with the analysis. Be prepared to try several remedies and to choose the
best among them.
Assumptions of Regression
Linear regression rests on four assumptions. (1) The true relationship between the two
variables is actually a straight line. (2) At any value of X, there is a distribution of possible Y
values whose mean falls on the line. The variance of these Y values is assumed to be the
same at all values of X. (3) The distribution of possible Y values at each X is normally
distributed. (4) Observations are independent of another. The assumption of independence
can be met only by proper experimental design. The information below suggests a number of
transforms that may work in particular situations. In general, though, log or ln transforms are
the most likely transform to correct a problem with one of the assumptions of regression.
1. Linearity - Scatterplots of Y against X are useful for determining whether the relationship
between two variables is linear. By adding the regression line to a scatterplot, you provide a
straight reference line against which deviations from linearity can be compared. You might
also see how well an alternative curve that is not constrained to be linear fits the same data.
Spline fitting is a technique that plots the trend in Y against X without assuming that the
relationship is a straight line. If the spline fit closely mimics the linear regression, then the
assumption of linearity is a safe bet. If not, then something must be done to the data before
regression can be used. When the computer fits a spline to a set of data, it generates a series
of polynomial regressions (curved lines) using subsets of the data, and then smoothes these
curves together to demonstrate the general trends in the data set. If the spline fit tracks
roughly alongside a straight line, and doesn't show a systematic and regular deviation from
the straight line, then it is likely that the relationship between X and Y is linear. When you
carry out a spline fit, you must select a value of lambda, which specifies how straight the
spline will be. A lambda of between 100 and 1 will usually allow the spline fit to wander
enough to determine if the relationship appears to be linear. Be prepared to experiment with
different spline values.
Suppose that the relationship between Y and X is not linear. Transformations offer no hope if
the real curve contains either a distinct peak or a distinct valley (for instance a bimodal
distribution). However, they might work if the curve increases uniformly or decreases
uniformly from left to right. Log and square root transformations can be used to linearise
relationships. Try both transformations and use the one that works best. You also have a
choice of transforming either X or Y. Your choice of variable depends upon the shape of the
curve. Transform Y if the data might be linearised by pulling down extremely large values of
Y. Transform X to pull down large values of X (see figure below). In some cases, the
Biology 300
Page 77
relationship between X and Y will be clearly nonlinear, but transformation pulls down the
large values of a variable so far that the curve is bent in the opposite direction from the
original curve. For example, the curve in the upper left panel of the figure below is bent into
the curve in the upper right panel. This result often occurs when the true relationship between
two variables is a power function (e.g. Y = 3 e1.4X), and it can be remedied by transforming
both X and Y.
2. Equality of variance - It is difficult to test this assumption statistically, because we
generally only have one or two values of Y for any X, preventing reasonable variance
calculations. Instead, we usually rely on graphical analysis by viewing the spread of data
points around the regression line. If the variance is constant, the spread should remain
roughly constant along the line. The spread should change inconsiderably if the variance
changes, becoming small if the variance decreases and large if the variance increases. A
simple scatterplot of Y against X together with the fitted regression line is useful for this
purpose, but it has the disadvantage that the viewer must look at the spread around a sloping
line, which is not as easy as viewing the spread above and below a horizontal line. Because
our goal is to observe the spread in Y relative to the predicted values of the line, the effect of
the slope can be removed by first calculating the difference between Y and the predicted
value Y. This difference (Y - ) is known as the residual. A scatterplot of residuals against
X should yield a horizontal band of data points with constant height if variance is indeed
constant.
Transformations to equalize variance must be applied to Y. Both the log and square root
transformations are available. Try both to see which does a better job at equalizing variance.
A third transformation is available when analyzing proportions. Proportions represent a
special case because variance is expected to rise as p increases from 0 to 0.5 and then fall as
p increases from 0.5 to 1. The arcsine-square root transformation is appropriate under these
circumstances.
3. Normality - The assumption is that Y is normally distributed about the predicted value of
Y at any value of X. This assumption is the least important of the assumptions of regression
analysis, and we will not discuss it further. Note that most transformations of Y that linearise
data or control variance also tend to normalize data. Use large sample sizes, and don't worry
about this assumption if you've got the others down.
Testing the Significance of a Regression Relationship
Generally, regression analysis is only done with parametric data. There are some fairly
quantitative non-parametric approaches, but they are not widely used. Since we are trying to
demonstrate a very rigorous relationship (causal), we need rigorous experimental designs and
analyses. We need parametric data.
There are two main ways to test for the significance of a regression relationship. The simplest
Page 78
Biology 300
(at least by hand) is the F test, which just tells us if the slope of the line is significantly
different from zero. A zero slope or flat line means that no matter how X changes, Y is
unaffected. Therefore, a significant slope means that X is causing changes in Y. To do an F
test we will carry out an ANOVA in which we act as if each value of X represents a different
sample or treatment for Y. If the mean Y values for each X are the same, then the line is flat
and there is no relationship. The calculations of the intermediate components of the
regression ANOVA are slightly different than for a typical ANOVA, simply because we
usually have only one or two values of Y for any X. The theory behind the analysis is
identical, however. As usual, the ANOVA is a two-tailed test, but we calculate a one-tailed F,
since we assume that the regression variance is bigger than the residual variance.
If we want to test for a positive slope or a negative slope or a non-zero slope, we must use a
different approach. A t test of the significance of  allows us to carry out one-tailed, twotailed or non-zero tests.
Describing the Regression Relationship
If we find that there is a significant relationship, then we usually would like to describe the
relationship. The equation of the line is our best description of the relationship between X
and Y. We can use this equation to predict values of Y from X, and for fixed effects
regression models we have a limited ability to predict X from Y (inverse prediction). We
must be careful not to extrapolate beyond the range of our data, however. Prediction from
regression is only valid within the range of our data, since this is the only area where we have
tested our assumptions (e.g. linearity). If we have transformed one or both of our variables
we must include that transform when we report the equation of the line. If we had
transformed Y to ln (Y + 0.5) to solve a variance problem we would report the equation as
ln (Y + 0.5) = a + bX.
Another useful tool for describing the relationship is r2, the coefficient of determination. This
value tells us the strength of the relationship, or what proportion of the change in Y can be
predicted from changes in X. r2 varies between 0 (no predictive power) and 1 (the equation of
the line explains all of the changes in Y).
Finally, we can produce confidence intervals for our regression line. These intervals give us
an indication of the predictive accuracy of our line. The intervals are tightest at mean X and
mean Y since our equations force the line to go through this point. As we move further away
from the mean pair of values, our uncertainty about the position of the line increases and our
intervals become larger. These intervals are produced by summing our uncertainty about the
intercept and our uncertainty about the slope.
Simple Linear Correlation
In correlation analysis, values of X are not preselected by the investigator, but occur at
random. Here we are not trying to demonstrate a causal or predictive relationship. We are
merely trying to see if two variables are associated or vary together. Thus, correlation does
Biology 300
Page 79
not designate one variable as dependent and the other as independent.
The unfortunate result of this relationship is that correlation has many assumptions:
1. For each value of Y there is a normally distributed population of X values.
2. For each value of X there is a normally distributed population of Y values.
3. For each value of Y there is equal variance in the population of X values.
4. For each value of X there is equal variance in the population of Y values.
5. Both Y and X values were selected randomly and independently.
6. The relationship between X and Y is linear.
It is hard to meet these assumptions and can be even more difficult to demonstrate that we
meet the assumptions. We are usually satisfied if the trend in our data set is a symmetric,
oval cloud of data points whose slope angles upwards or downwards.
In a correlation analysis, the strength of the relationship between two variables is indicated
by r, the correlation coefficient. This value is mathematically equivalent to the square root of
r2 in regression analysis. It should not be considered equivalent, however, because of the
theoretic differences underlying correlation and regression. The correlation coefficient may
take on values from -1 (negative linear correlation) to +1 (positive linear correlation). A
value of 0 indicates no linear correlation. We can also produce confidence intervals on r. If r
is near -1 or 1, however, the intervals will not be symmetric and a normal distribution is not
adequate to describe the data.
The significance of a correlation is tested via a t test of r. As usual, the t test is a mean
difference divided by the standard error of the mean difference. In this case, t equals the
difference between our calculated r and any hypothetical value of rho (the parameter for the
relationship between X and Y), divided by the standard error in r.
Non-parametric correlation is fairly common since the assumptions of correlation are
sweeping and difficult to test. Generally, we can examine the scatterplots to find evidence of
a non-linear relationship or violations of the assumption of equal variances among the
samples. Spearman's or Kendall's rank correlation are the most common non-parametric
equivalents to correlation. They have similar power and assumptions and one of the two
should be used if the assumptions of parametric correlation do not seem to be met, and
sample sizes are small.
Using the Program
Beginning a regression analysis - To carry out either regression or correlation analysis we
need two continuous variables. To carry out a regression, choose fit Y by X and designate
the causative or predictive variable as X. Your Y variable should be the variable that is the
Page 80
Biology 300
effect, which is being predicted. The computer will display a scatterplot of the data. It may be
possible to see trends in the data just by examining this plot. If you click on the fitting
button, you will be presented with a number of options. First, choose fit line. This will carry
out a simple linear regression. We won't worry about the summary tables for this line just
yet. For now, we want the regression line for its use in determining if this association is
linear. Once we have a fitted line displayed, clicking on the reveal button (the arrowmarked button) for the particular line will allow us to plot confidence intervals for that line.
Checking linearity - To test if the data are linear, carry out a spline fit from the same fitting
dialog box, with a lambda of about 1 or 10. Go higher or lower if these values show too
little or too much wobble and remember that we are just trying to see if the general trend of
the line is straight. This curved line should track more or less alongside the linear fit. We will
worry if there is a strong trend away from the data. The diagrams above show us how we can
correct some of the more common problems in linearity. The summary table produced when
we carry out a spline fit show the r2 value and sum of squares error for the spline curve.
These can be compared to the linear values to see if there is much improvement when we
have a non-linear relationship. If a straight line is a good description of the relationship
between X and Y, the spline values will be similar to the linear values. Use the spline r 2 only
for this purpose, though. Since r2 is an indicator of how tightly values cluster around our line,
we can increase r2 to near 1 if we let the line wobble enough.
Checking variance -To test if the variances in Y are equal, we can examine a residual plot. If
the residuals form a fairly symmetric band across the graph, then our variances are roughly
equal. If the scatter of points around the residual line looks like the spray from the end of a
hose (variance in Y increases as X increases), we need to transform Y. In fact, any shape
other than an equal band suggests we should transform Y. To call up a residual plot click on
the reveal button for our linear fit.
Transformations - generally, log (common or natural) transforms are the safest bet for data
sets which violate the assumptions of regression. There are two ways to transform data for a
regression analysis. The simplest way is to select fit transformed from the analysis dialog
box. This calls up another menu, where we can choose an appropriate transform for X or Y.
Unfortunately, this method of transforming will not adjust the scale of measurement for the
diagram of scatterplots and our assorted lines of fit. The axis scale on the residual plot will
also stay the same. The result is that it is difficult to see if our transformation improved our
ability to meet assumptions.
The second and better method of transforming your data is to create new columns of data by
applying a transform formula to our raw data. This is the same method we used in last
week's ANOVA lab. This method will produce new scales for the axis of our scatterplots and
residual plots, letting us judge whether the transform worked.
Regression Statistics - When we carry out a linear fit, the program produces summary tables
showing the ANOVA results, the equation of the line, the parameter estimates (a and b),
the standard errors for our parameters, t tests for both the intercept and the slope, and the
Biology 300
Page 81
value for r2.
In the table of parameter estimates, our estimate of the intercept is a, and our estimate for the
coefficient of X is b (listed as the estimate of X). The t statistic and p value for b, is a t test
of slope. The probability shown is a two-tailed probability so you will have to adjust the p
value if you want to carry out a one-tailed test. The appropriate p value will either be one half
of the two-tailed value or 1 minus one half of the two-tailed value. You must work out which
is appropriate, based on your hypotheses. Clicking on the reveal button for the parameter
estimates table will provide a dialog box that lets you generate confidence intervals for the
estimates.
Correlation - To carry out a correlation, we can fit Y by X, designating either variable as X
or Y (they are interchangeable here since we are not designating a dependent and an
independent variable). Click on the fitting button from the resulting scatterplot and choose
density ellipses. The probability we choose for our density ellipse is used to show a region in
which we have that specified level of certainty that the bivariate normal (normally distributed
in both X and Y) population of paired values should fall. The probability that we choose will
not affect the statistics for the correlation. The summary statistics appear in the bivariate
table. The values displayed here include r, and the probability of obtaining r (the p value
from a test of the significance of r).
A second way to carry out a correlation is to choose analyze, then correlation of Y's. Add
the two variables you want to check for a correlation. This will produce a report including r,
the correlation coefficient. Clicking on the check menu in the lower left of the box lets us do
pairwise correlations, whose summary table includes r and a significance test of r. These
values are identical to those produced by the density ellipse approach. Clicking on the check
menu again, we can call up the correlations - nonpar that will conduct non-parametric rank
correlation analysis. The resulting summary table displays the Kendall's and Spearman's
test statistics and p values. Clicking on the reveal button for this table lets us see the values
from a third non-parametric correlation test, the Hoeffding test, which is less frequently used
than the first two tests.
Problems
1. The original description of regression effects was from a study of parent heights and child
heights conducted by Sir Francis Galton in 1877. The data from this study is stored in the file
Galton.jmp in the shared directory. In this study parent height (X) is the predictor for child
height (Y).
a) Is a straight line adequate to describe the relationship between parent height and child
height?
Page 82
Biology 300
b) Does variance in child height remain constant across all values of parent height?
c) Test the hypothesis that parent height can be used to predict child height.
d) Why do the confidence bands for the regression curve out at their ends?
Biology 300
Page 83
2. The nitrogen content of leaves (mmol/m2) is thought to be a predictor of photosynthetic
rate (mmol/m2/s) and productivity of agricultural crops. Both of these variables were
measured and stored in a file named NITRO.
a) Examine the data and decide if the assumptions of linearity and equal variance are met
for this data set.
b) Suppose there is a problem with the assumption of equal variance. Transform the
appropriate variable (by the method of adding a new column) and compare the residual
plots to see if there is any improvement. Can we see this improvement if we transform
the variable through the fit transform method?
c) Is it possible to predict photosynthetic rate based on the nitrogen content of leaves? If
so, describe the relationship algebraically.
Page 84
Biology 300
d) How well does the relationship predict the photosynthetic rate?
3. Crows often steal food from birds of other species. To test whether potential victims are
sensitive to the presence of potential robbers, an experiment was conducted in which a model
crow was placed near the nests of pigeon guillemots. Observers recorded the amount of time
that arriving guillemots sat and waited before delivering food to their hungry nestlings. A
negative relationship between time waited and distance of the model from the nest would
suggest that would-be victims respond to the risk of robbery. Data from the experiment are
stored in the file ROBBER. Variable #1 is distance of the model crow from guillemot
nests (m), and variable #2 is amount of time that guillemots paused (min) before returning to
their nests.
a) Is a transformation required to linearize the data?
b) Is the assumption of equality of variances met in this example?
Biology 300
Page 85
c) Test the hypothesis that pigeon guillemots are reluctant to approach their nests when a
potential robber is close at hand.
4. Brain weight (g) and testes weights (mg) were measured in 8 male chipmunks (Tamias
spp.). The data are stored in a file named CHIPMUNK.
a) Examine and comment on a scatter plot of the data.
b) Is there a relationship between brain weight and testes weight in these chipmunks?
Show all steps taken in testing the null hypothesis.
Page 86
Biology 300
c) Try testing the significance of the correlation using whichever method you didn't use
in part b. Are the results the same?
d) Suppose a significant relationship was found between the variables measured in this
study. Is it possible to predict brain weight from measurements of testes weight in these
chipmunks?
e) What is the value of the correlation coefficient in this analysis. Do you feel that the
correlation coefficient indicates a strong relationship between the variables measured in
this study?
f) Suppose the bivariate normality assumption of correlation analysis is not met in this
study with its small sample size. What test would you use to determine the strength of the
relationship between the variables measured? Try it (show all steps taken in testing the
null hypothesis). Do you reach the same conclusion as before? If not, which conclusion
would you agree with?