Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BCS 398, Spring 2009 Data analysis 1- Regression and ANOVA Summaries & Analyses The kind of summary tables and graphs that you generate and the statistical analyses you do will depend on the specific questions you are trying to answer. For example, you might want to know if there is a statistically significant relationship between the height of a plant and the reproductive output of that plant (the number of seeds or fruits that it produced). To do this, you might first produce a graph of # of flowers (dependent variable) versus plant height (independent variable) to visually examine this relationship, and then perform a regression to see if there is a statistically significant relationship between the two variables. You can do both of these operations in Excel. Another question you might ask is whether average plant height or reproductive output was different among four plot treatments (control, low nitrogen addition, high nitrogen addition, shrub removal). You might use a table or a bar graph to summarize these data, and an Analysis of Variance (ANOVA) to see if there were statistically significant differences among the treatments. Again, you can perform these operations in Excel. The following pages provide some background information about two very useful statistical procedures, linear regression and Analysis of Variance (ANOVA). They also provide some hints about doing these analyses in Excel. Linear Regression Linear regression is a statistical procedure that tests for a linear (straightline) relationship between two variables. To perform a linear regression in Excel, place your independent and dependent variables in two columns, with the paired values in the same rows. For example, put data for plant height in one column and data for # of flowers in a second column, with data for each individual plant in one row. There should be no empty cells in the ranges that you give for the independent and dependent variables. Choose ‘Tools’, ‘Data Analysis’, and ‘Regression’ from the Excel menus, and enter the addresses for the independent (X) and dependent (Y) variables. You also can enter a cell address for the ‘output range’; this is where Excel will put the results of the analysis (if you don’t enter an output range, Excel will put the output on a new page). Choose an empty area in your spreadsheet that is at least 7 columns wide and 18 rows high. Figure 1 shows an example of the resulting output (X values = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; Y values = 2, 4, 5, 8, 6, 9, 11, 13, 12, 15). If you haven’t done a regression in Excel before, you might want to enter these numbers (X values in one column, Y values in another) and see if you can match the results in Figure 1. BCS 398, Spring 2009 Data analysis 2- A class in statistics is not a prerequisite for this course, so a few of the terms in this regression output are defined here. Multiple R: This is a measure of how tight the relationship is between these two variables. This ranges in value from 0 - 1.0; values close to 1 indicate a very tight relationship - "a good fit." R Square: This is a measure of the proportion of the variation in the dependent variable that you can explain with the independent variable. Values range from zero to 1.0; if your data points fall exactly on a straight line this value will equal 1. What that means is that your data can be completely explained by the equation for a line! Observations: This is your sample size, or the number of data points that were used in the analysis. The more observations you have, the easier it is to identify a statistically significant relationship. Intercept: The intercept (or Y intercept) is the value on the Y axis where the regression line intercepts the Y axis. In the equation Y = mX + b (1) the intercept is ‘b’. The value of the intercept is given near the bottom of the output table in the column labeled ‘Coefficients’. Slope: The slope is the rate at which the dependent (Y) value changes as the independent (X) value changes. In equation (1) above, the slope is represented by ‘m’. The value of the slope is given near the bottom of the output table immediately below the intercept (X Variable 1, Coefficient). P-value: This is the probability that you would get a relationship as tight as the one in your data set by chance alone. Pvalues will be between 0.0 and 1.0. A very small value (close to zero) indicates that there is very little chance that you would observe a relationship like this by chance alone. In a regression analysis, the P-values associated with the intercept and slope indicate the probability that these values differ from zero. Scientists typically conclude that a Pvalue that is less than or equal to 0.05 is statistically significant. Note, however, that a P-value of 0.05 suggests that you would observe this result 1 out of 20 times, even if there were no biological relationship between the two variables. With a very large sample size it is possible to have a statistically significant relationship (p < 0.05) that explains a very small proportion of the variation in your dependent variable (R Square is very small). In the example given in Figure 1 there is a highly statistically significant relationship between the two variables. This relationship can be summarized with the following equation: Y = 1.0X - 3.6 (2) Knowing the value of the independent variable, you can expect to explain nearly 94% of the variation in the dependent variable (Adjusted R-Square = 0.937). You can be very confident that the slope is different from zero because the P-value for the X Variable 1 Coefficient is very small (<0.0001). You cannot be very confident that the intercept is different from zero because its P-value is 0.207, suggesting that you would see a value equally different from zero more than 20% of the time even if the true intercept were zero. BCS 398, Spring 2009 Data analysis 3- When you report the results of a regression you don’t typically need to reproduce the entire output from Excel. For the purposes of this class, report the regression equation, the sample size, the P-value for the slope, and the R Square value. For example, you might report the results from the analysis in Figure 1 as follows: There was a statistically significant positive relationship between Y and X (p < 0.001). SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.972 0.944 0.937 1.066 10 ANOVA df SS Regression 1 Residual Total 8 9 Coefficients Intercept X Variable 1 MS 153.409 153.409 9.091 162.5 Standard Error 1 1.364 Figure 1. Sample output of regression in Excel. 0.728 0.117 F Significanc eF 135 2.74051E06 1.136 t Stat 1.373 11.619 P-value Lower 95% 0.207 2.74E06 -0.679 1.093 Upper 95% 2.679 1.634 BCS 398, Spring 2009 Analysis of Variance (ANOVA) ANOVA is a statistical procedure that compares two or more groups (populations) to determine if there are statistically significant differences among those groups. For example, you might want to know if there were differences in the density of flowering (or non-flowering) plants on plots that received different treatments. ANOVA considers all the variation that exists within the entire sample (all plots), and determines what proportion of that variation can be explained by the experimental treatments (plot treatments). If a large proportion of the total variation can be explained by the experimental treatments, there will be a statistically significant treatment effect, and you will be able to state with some confidence that the treatments caused a difference in the variable you measured. If only a small proportion of the variation can be explained by experimental treatments, then it is likely that the treatments did not cause a difference in the variable you measured. To perform an ANOVA in Excel arrange your data so that values for each group are in a single column (Figure 2). From the ‘Tools’ menu, choose ‘Data Analysis’ and ‘ANOVA: Single Factor’. For ‘Input Range’, enter the block of cells that contains your data. Indicate that your data are grouped by columns, and give an address for the Output Range. The first portion of the output, labeled ‘SUMMARY’ gives some summary Data analysis 4- statistics for each of the groups in your data set. The second portion, labeled ‘ANOVA’, gives the results of the statistical test. SS: stands for ‘Sum of Squares’. This is a measure of the variation in your data set. SS Between Groups: This is a measure of the variation that exists between your groups. If this is large relative to the total SS, then there is a high probability that there really is a difference among the groups. SS Within Groups: This is a measure of the variation that exists within your groups. If the variation within groups is as great as the variation among groups, then chances are your groups are not really different. df: degrees of freedom. This is determined by the number of groups that you have (Between Groups df) and the sample sizes of the groups (Within Groups df). Similar to a regression, the larger your sample size the more easily you can detect a statistically significant relationship with an ANOVA. P-value: (see discussion under Regression). For an ANOVA, the Pvalue indicates the probability that you would see similar differences among your groups by chance alone. A very small P-value (< 0.05) indicates that there is a good chance that there really are differences among the groups. A larger P-value suggests that you can’t identify differences among these groups with your data. BCS 398, Spring 2009 Group 1 Data analysis 5- Group 2 Group 3 1 7 14 3 8 15 4 5 19 2 9 11 5 6 26 6 Anova: Single Factor SUMMARY Groups Count Sum Average Variance Column 1 6 21 3.5 3.5 Column 2 5 35 7 2.5 Column 3 5 85 17 33.5 ANOVA Source of Variation Between Groups Within Groups Total SS df MS 520.938 2 260.469 161.5 13 12.423 682.438 15 F P-value 20.967 8.54501E-05 F crit 3.806 Figure 2. Sample output of ANOVA. Data used in the analyses are in the upper left corner; there were 3 groups with 5 or 6 samples in each group. In this example a large proportion of the total variation can be explained by the groups, and there is a statistically significant difference among groups (p < 0.001).