Download Agricultural Statistics and Data Representation Part 1 ( file)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
ENVS 133: Winter 2006.
* Agricultural Statistics & Data
Representation *
Part I:
Regarding Hypotheses and Hypothesis Design1
Developing a hypothesis is prerequisite to statistical
design and analysis. There are two kinds of hypotheses: A
null hypothesis (H0) and an alternative hypothesis (H1). H0
can never be proven to be correct, although it can be
rejected with known risks of being falsified (see
significance levels below).
For example, If H0 says that late maturing broccoli
cultivars have better yields than early
maturing cultivars, but you observed higher yields in late
cultivars in
your experiment, then you would reject H0.
Hypothesis Testing and Significance Levels
Significance levels can be thought of as measure of
probability that are used if one rejects H0, when H0 is
actually true. Because it is next to impossible to be 100%
sure that something is true 100% of the time, as scientists
we allow some room for this kind of error. Usually we like
to make sure that we are at lest 95% sure we are right, but
the higher this number the better. When we are at least 95%
sure that we are correct, we delineate this by stating that
the probability (P) is ≤ 0.05, meaning that there is less
than a 5% chance we are wrong.
When we change this to P ≤ 0.1, we will increase the chance
of error. This can be considered a Type II error-- the
probability of accepting H0 as true when in fact it is not.
“Type I errors” are the opposite: rejecting H0 when it is
actually true. There is always a tradeoff, and the only way
to reduce one error without increasing the other is to
1
Note that much of this information was based on materials borrowed from the Agronomy 206 webpage (UC Davis). This page has a
lot of useful and downloadable material on agricultural statistics. http://www.agronomy.ucdavis.edu/agr205/
1
improve the experimental design (through increased
replication an sample size).
Data Summarization
 Mean: the mean is just the sum of all the observations
divided by the number of
observations.
 Median: The number that separates the higher half of a
population or a
probability distribution from the lower half (i.e. the
middle value in a distribution, above and below which lie an
equal number of values). If you have a sequence of data
<1,2,3,4,5,6,7,8,9>, 5 is the median. To find the median of
a finite list of numbers, arrange all the observations from
lowest value to highest value and pick the middle one.
 Mode: The mode is the value that has the largest number
of observations, namely
the most frequent value within a particular set of values.
If you have a data sequence <1, 3, 6, 6, 6, 7, 7, 12, 12,
17>, the mode is 6.
 Variance: A statistical measure of the dispersion of a
set of values about its mean.
The larger the variance, the larger the distance of the
variable from the mean of the data set. The variance is the
square of the standard deviation.
 Standard Deviation: Introduced to statistics by Karl
Pearson (1894), the
standard deviation is the most commonly used measure of
statistical dispersion. Simply put, it measures how spread
out the values in a data set are. The standard deviation is
defined as the square root of the variance. In a normal
distribution, 68% of all measurements fall within one
standard deviation of the average. 95% of all measurements
fall within two standard deviations of the average.
 Standard Error: The standard error of a sample from a
population is the
standard deviation of the sampling distribution divided by
the square root of the population size (N). Literally, it is
a measure of an estimate's variability. Standard errors are
2
important because they reflect how much sampling fluctuation
a statistic will show. The standard error of a statistic
depends on the sample size. In general, the larger the
sample size the smaller the standard error.
 Degrees of Freedom (Df): Literally the degrees of
freedom is the sample size
minus one. For example, where N=77, the Df =76. This is
important for performing numerous statistical tests.
Hypotheses Testing
Have a read through the material below. Concentrate on
getting the concepts. We will have a chance in lab to play
with these tests and do graphic representation of data. I
have left out the equations that would normally accompany a
handout like this (I am not a big fan of equations), but
they are important and you will eventually need to learn
them. As for now, because we will be using Excel for
manipulating data, we can ignore them.
A. Parametric Tests
Parametric Tests: These tests rely on assumptions that data
are normal. Note that there a variety of tests to deal with
non-normally distributed data, although we will not
encounter these more complex topics during this course.
Normal distributions are a family of distributions that have
the same general shape. They are symmetric with data more
concentrated in the middle than in the tails (see graphic).
Normal distributions are sometimes described as bell shaped.
Notice that distributions can differ in spread.
QuickTime™ and a
TIF F (Uncompressed) decompressor
are needed to see this picture.
 Student's t-test
3
Definition: "Student" W. S. Gossett [1876-1937]
developed statistical methods to solve production
problems in his beer brewery. He was a mathematical
genus, and the T-test is his crowning achievement. The
t-test is a test of the null hypothesis that the means
of two normally distributed populations are equal. All
such tests are usually referred to as Student's ttests. We use this test for comparing the means of two
treatments, even if they have different numbers of
replicates. In simple terms, the t-test compares the
actual difference between two means in relation to the
variation in the data (expressed as the standard
deviation of the difference between the means).

Chi-square
Definition: A statistic used to compare frequencies of
two or more groups. Chi-square is a test to determine
the probability that an observed deviation from the
expected event or outcome occurs solely by chance. In
ecology the most common application for chi-squared is
in comparing observed counts of particular cases
(species, for example) to the expected counts.

Analysis of Variance (ANOVA)
Definition: In statistics, analysis of variance (ANOVA)
is a collection of statistical measures performed in a
procedure to compare means by splitting the overall
observed variance into different parts. Analysis of
variance were pioneered by the statistician and
agricultural geneticist Ronald Fisher in the 1920s and
1930s, and is sometimes known as Fisher's ANOVA or
Fisher's analysis of variance. ANOVA is used to test
for significance where you suspect that you might have
an environmental gradient across your treatments; for
example, it is useful in eliminating the effects of
variability in soil types in a field experiment.
Parametric tests focused on relationships between
independent and dependent variables (e.g. biomass and weed
density):
 Pearson’s Correlation (generally just called a
correlation)
Definition: A numerical measure of association between two
variables that reflects the degree to which the variables
are related. Correlation is a statistical technique which
can show whether and how strongly pairs of variables are
4
related. For example, height and weight are related taller people tend to be heavier than shorter people. The
relationship isn't perfect, but it is somewhat correlated.
 Linear Regressions
Definition: Linear regression analyzes the relationship
between two variables, X and Y. A classic statistical
problem is to try to determine the actually relationship
between two random variables X and Y. For example, we might
consider height and weight of a sample of adults. Linear
regression attempts to explain this relationship with a
straight line fit to the data. For each subject (or
experimental unit), you know both X and Y and you want to
find the best straight line through the data. In some
situations, the slope and/or intercept have a scientific
meaning. In other cases, you use the linear regression line
as a standard curve to find new values of X from Y, or Y
from X.
A linear regression line has an equation of the form Y = A +
bX, where X is the explanatory variable and Y is the
dependent variable. The slope of the line is b, and A is the
intercept (the value of y when x = 0).
The value R2 is a fraction between 0.0 and 1.0, and has no
units. An R2 value of 0.0 means that knowing X does not
help you predict Y. When R2 equals 1.0, all points lie
exactly on a straight line with no scatter. Knowing X lets
you predict Y perfectly.
5
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Question: What is wrong with how this graph is presented to
you?
Data Representation in Graphs2
 General Points on Chart-marking
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
2
Many of these graphs are sourced from: http://lilt.ilstu.edu/gmklass/pos138/datadisplay/
6
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Note: I wouldn’t actually say this chart is complete.
What might be missing?
7
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Note that it is difficult to read the information here
because of unnecessary 3D effects, poor labeling and the
grey background.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Note: Although these don’t have them, wherever possible it
is best to include error bars on charts and on graphs.
8
9