Download THE INFLUENCE OF ACIDIFICATION ON AMMONIFICATION

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
BCS 398, Spring 2009
Data analysis 1-
Regression and ANOVA
Summaries & Analyses
The kind of summary tables and
graphs that you generate and the
statistical analyses you do will depend
on the specific questions you are trying
to answer. For example, you might want
to know if there is a statistically
significant relationship between the
height of a plant and the reproductive
output of that plant (the number of seeds
or fruits that it produced). To do this,
you might first produce a graph of # of
flowers (dependent variable) versus
plant height (independent variable) to
visually examine this relationship, and
then perform a regression to see if there
is a statistically significant relationship
between the two variables. You can do
both of these operations in Excel.
Another question you might ask
is whether average plant height or
reproductive output was different among
four plot treatments (control, low
nitrogen addition, high nitrogen addition,
shrub removal). You might use a table or
a bar graph to summarize these data, and
an Analysis of Variance (ANOVA) to
see if there were statistically significant
differences among the treatments. Again,
you can perform these operations in
Excel.
The following pages provide
some background information about two
very useful statistical procedures, linear
regression and Analysis of Variance
(ANOVA). They also provide some hints
about doing these analyses in Excel.
Linear Regression
Linear regression is a statistical
procedure that tests for a linear (straightline) relationship between two variables.
To perform a linear regression in
Excel, place your independent and
dependent variables in two columns,
with the paired values in the same rows.
For example, put data for plant height in
one column and data for # of flowers in a
second column, with data for each
individual plant in one row. There
should be no empty cells in the ranges
that you give for the independent and
dependent variables.
Choose ‘Tools’, ‘Data Analysis’,
and ‘Regression’ from the Excel menus,
and enter the addresses for the
independent (X) and dependent (Y)
variables. You also can enter a cell
address for the ‘output range’; this is
where Excel will put the results of the
analysis (if you don’t enter an output
range, Excel will put the output on a new
page). Choose an empty area in your
spreadsheet that is at least 7 columns
wide and 18 rows high. Figure 1 shows
an example of the resulting output (X
values = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; Y
values = 2, 4, 5, 8, 6, 9, 11, 13, 12, 15).
If you haven’t done a regression in Excel
before, you might want to enter these
numbers (X values in one column, Y
values in another) and see if you can
match the results in Figure 1.
BCS 398, Spring 2009
Data analysis 2-
A class in statistics is not a
prerequisite for this course, so a few of
the terms in this regression output are
defined here.
Multiple R: This is a measure of how
tight the relationship is between these
two variables. This ranges in value from
0 - 1.0; values close to 1 indicate a very
tight relationship - "a good fit."
R Square: This is a measure of the
proportion of the variation in the
dependent variable that you can explain
with the independent variable. Values
range from zero to 1.0; if your data
points fall exactly on a straight line this
value will equal 1. What that means is
that your data can be completely
explained by the equation for a line!
Observations: This is your sample size,
or the number of data points that were
used in the analysis. The more
observations you have, the easier it is to
identify a statistically significant
relationship.
Intercept: The intercept (or Y intercept)
is the value on the Y axis where the
regression line intercepts the Y axis. In
the equation
Y = mX + b
(1)
the intercept is ‘b’. The value of the
intercept is given near the bottom of the
output table in the column labeled
‘Coefficients’.
Slope: The slope is the rate at which the
dependent (Y) value changes as the
independent (X) value changes. In
equation (1) above, the slope is
represented by ‘m’. The value of the
slope is given near the bottom of the
output table immediately below the
intercept (X Variable 1, Coefficient).
P-value: This is the probability that you
would get a relationship as tight as the
one in your data set by chance alone. Pvalues will be between 0.0 and 1.0. A
very small value (close to zero) indicates
that there is very little chance that you
would observe a relationship like this by
chance alone. In a regression analysis,
the P-values associated with the intercept
and slope indicate the probability that
these values differ from zero.
Scientists typically conclude that a Pvalue that is less than or equal to 0.05 is
statistically significant. Note, however,
that a P-value of 0.05 suggests that you
would observe this result 1 out of 20
times, even if there were no biological
relationship between the two variables.
With a very large sample size it is
possible to have a statistically significant
relationship (p < 0.05) that explains a
very small proportion of the variation in
your dependent variable (R Square is
very small).
In the example given in Figure 1 there is
a highly statistically significant
relationship between the two variables.
This relationship can be summarized
with the following equation:
Y = 1.0X - 3.6
(2)
Knowing the value of the independent
variable, you can expect to explain
nearly 94% of the variation in the
dependent variable (Adjusted R-Square
= 0.937). You can be very confident that
the slope is different from zero because
the P-value for the X Variable 1
Coefficient is very small (<0.0001). You
cannot be very confident that the
intercept is different from zero because
its P-value is 0.207, suggesting that you
would see a value equally different from
zero more than 20% of the time even if
the true intercept were zero.
BCS 398, Spring 2009
Data analysis 3-
When you report the results of a
regression you don’t typically need to
reproduce the entire output from Excel.
For the purposes of this class, report the
regression equation, the sample size, the
P-value for the slope, and the R Square
value. For example, you might report the
results from the analysis in Figure 1 as
follows:
There was a statistically significant
positive relationship between Y and
X (p < 0.001).
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R
Square
Standard Error
Observations
0.972
0.944
0.937
1.066
10
ANOVA
df
SS
Regression
1
Residual
Total
8
9
Coefficients
Intercept
X Variable 1
MS
153.409 153.409
9.091
162.5
Standard
Error
1
1.364
Figure 1. Sample output of regression in
Excel.
0.728
0.117
F
Significanc
eF
135 2.74051E06
1.136
t Stat
1.373
11.619
P-value Lower 95%
0.207
2.74E06
-0.679
1.093
Upper
95%
2.679
1.634
BCS 398, Spring 2009
Analysis of Variance (ANOVA)
ANOVA is a statistical procedure that
compares two or more groups
(populations) to determine if there are
statistically significant differences
among those groups. For example, you
might want to know if there were
differences in the density of flowering
(or non-flowering) plants on plots that
received different treatments.
ANOVA considers all the variation that
exists within the entire sample (all
plots), and determines what proportion
of that variation can be explained by the
experimental treatments (plot
treatments). If a large proportion of the
total variation can be explained by the
experimental treatments, there will be a
statistically significant treatment effect,
and you will be able to state with some
confidence that the treatments caused a
difference in the variable you measured.
If only a small proportion of the
variation can be explained by
experimental treatments, then it is likely
that the treatments did not cause a
difference in the variable you measured.
To perform an ANOVA in Excel arrange
your data so that values for each group
are in a single column (Figure 2). From
the ‘Tools’ menu, choose ‘Data
Analysis’ and ‘ANOVA: Single Factor’.
For ‘Input Range’, enter the block of
cells that contains your data. Indicate
that your data are grouped by columns,
and give an address for the Output
Range.
The first portion of the output, labeled
‘SUMMARY’ gives some summary
Data analysis 4-
statistics for each of the groups in your
data set.
The second portion, labeled ‘ANOVA’,
gives the results of the statistical test.
SS: stands for ‘Sum of Squares’. This is
a measure of the variation in your data
set.
SS Between Groups: This is a measure
of the variation that exists between your
groups. If this is large relative to the total
SS, then there is a high probability that
there really is a difference among the
groups.
SS Within Groups: This is a measure
of the variation that exists within your
groups. If the variation within groups is
as great as the variation among groups,
then chances are your groups are not
really different.
df: degrees of freedom. This is
determined by the number of groups that
you have (Between Groups df) and the
sample sizes of the groups (Within
Groups df). Similar to a regression, the
larger your sample size the more easily
you can detect a statistically significant
relationship with an ANOVA.
P-value: (see discussion under
Regression). For an ANOVA, the Pvalue indicates the probability that you
would see similar differences among
your groups by chance alone. A very
small P-value (< 0.05) indicates that
there is a good chance that there really
are differences among the groups. A
larger P-value suggests that you can’t
identify differences among these groups
with your data.
BCS 398, Spring 2009
Group 1
Data analysis 5-
Group 2
Group 3
1
7
14
3
8
15
4
5
19
2
9
11
5
6
26
6
Anova: Single Factor
SUMMARY
Groups
Count
Sum
Average
Variance
Column 1
6
21
3.5
3.5
Column 2
5
35
7
2.5
Column 3
5
85
17
33.5
ANOVA
Source of Variation
Between Groups
Within Groups
Total
SS
df
MS
520.938
2
260.469
161.5
13
12.423
682.438
15
F
P-value
20.967 8.54501E-05
F crit
3.806
Figure 2. Sample output of ANOVA. Data used in the analyses are in the upper left corner; there were 3 groups with
5 or 6 samples in each group. In this example a large proportion of the total variation can be explained by the groups,
and there is a statistically significant difference among groups (p < 0.001).