Download outline - Ohio University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Statistical inference wikipedia , lookup

Gibbs sampling wikipedia , lookup

Categorical variable wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
1/14/13
Welcome to
PBIO 3150-5150
Statistical Methods in Plant Biology
(aka Biostatistics)
Spring 2013
OUTLINE
Review syllabus & introduction
Use & misuse of statistics
Statistics and biological data explained
Samples, populations, estimation
Intro to sampling design
Accuracy vs. precision
Types of variables
Frequency & probability distributions
Getting started: an example
Standard measures of central tendency
Mean, median, mode
Other means
Weighted, geometric, harmonic
Course Goals & Objectives
1. To provide you with an overview of the statistical
tools and procedures required to:
a. Evaluate the biological literature
b. Conduct original research
i. Design experiments
ii. Collect & interpret data
2. To familiarize you with the state-of-the-art
software (R) used to conduct, analyse, and report
scientific data. and manipulation
b. Graphics & presentation
c. Statistics
1
1/14/13
Statistical Pedagogy
What literature exists on the subject suggests that applied statistics is
learned best when:
a)  The material is subject specific (so we will use only biological
examples).
b)  Students have the opportunity to work through the material in a
step-by-step fashion (so you will do coded examples with R, as
well as exams).
c)  There is a practical element to learning (i.e., practically you will
only ever practice statistics on computers, so we will use
computers for everything including exams).
This course has been designed entirely around these core principles!
Personal Goal
1. To provide you with the best possible overview of
descriptive and experimental statistics within the
confines of a 10-wk academic quarter.
2. To heighten your awareness and critical thinking
with respect to statistical designs and biological
questions & ultimately do better research.
Hmm…is this
good statistics?
OUTLINE
Review syllabus & introduction
Use & misuse of statistics
Statistics and biological data explained
Samples, populations, estimation
Intro to sampling design
Accuracy vs. precision
Types of variables
Frequency & probability distributions
Getting started: an example
Standard measures of central tendency
Mean, median, mode
Other means
Weighted, geometric, harmonic
2
1/14/13
Use of Statistics
To what extent has the importance of statistics
in the biological sciences changed over the
last 100 years?
Survey conducted examining 11 decennial
volumes of The American Naturalist.
This journal has wide coverage and is
presumably a good indicator.
96%
(Sokal and Rohlf 1995).
Why Such an Increase in the Use
of Statistics in Biology?
Realization that most biological systems are not
deterministic but rather probabilistic.
Statistical thinking parallels ordinary scientific
thinking. We wish to quantify observations.
We express phenomena as a statement of
probability rather than as a vague general
statement.
3
1/14/13
The Future…
The use of quantitative data and major mathematical models will
only continue to increase (in all sub-disciplines of biology). The
R programming language is being increasingly used towards
this end (and therefore widely incorporated into this course).
We have many of the major biological patterns described, but
because of the variability inherent in the natural world we do not
yet understand many of the underlying processes. This will
require increasingly specialized quantitative skills.
Biology is not as straightforward as
physics or math—the rules are
different!
Misuse of Statistics
Statistics have frequently
been used to hide or
obfuscate important
information (usually where
economic or political gain
was at stake).
This led to the well known
quote by British Prime
Minister Disraeli: there are
three forms of falsehood in
the universe, lies, damned
lies, and statistics.
Misuse of Statistics
- Example -
U.S. Economy
Post-Depression
Two graphs,
same data, two
diametrically
opposed
conclusions!
Source: Huff (1954) How to Lie With Statistics
4
1/14/13
Don t Underestimate
Incompetence
Important Take-Home Point:
A statistical test is only as good as
the data it is supposed to test!
Virtually any experiment can yield
data, sophisticated statistics can
be employed, fanciful computer
software applied, and erudite
conclusions can be drawn…but, of
what biological relevance???
OUTLINE
Review syllabus & introduction
Use & misuse of statistics
Statistics and biological data explained
Samples, populations, estimation
Intro to sampling design
Accuracy vs. precision
Types of variables
Frequency & probability distributions
Getting started: an example
Standard measures of central tendency
Mean, median, mode
Other means
Weighted, geometric, harmonic
What is Statistics?
•  Statistics is a technology that describes and
measures aspects of nature from samples.
•  Statistics allows us to quantify the uncertainty of
these measurements (i.e., what is their departure
from the truth?).
•  Statistics is about estimation, the process of
inferring an unknown quantity of a target
population using sample data.
5
1/14/13
Statistics
•  A population is all the individuals
of interest (what we are trying to describe).
•  A sample is the subset of observations that we
select from the population to describe it.
•  Parameters are quantities describing the
population (unknown most of the time).
•  Estimates (or statistics) are the measures used
to approximate the parameters.
Observations, Samples, & Populations
Sample (X, s)
Population
(µ, σ)
e
Obs
ions
rvat
Statistics
Inference
(Estimation)
Good Samples
•  Obviously then, the sample completely controls
our view of the population.
•  Chance alone influences sampling error
(difference between estimates and parameters).
•  We need to collect a sample that is both accurate
and precise.
•  Bias is another form of error. It is a systematic
discrepancy between estimates and parameters.
6
1/14/13
Random Sampling
•  In order for a sample to be random, two
criteria must be met:
– Each member of the population has an equal
chance of being part of the sample, and
– Each observation is independent of every
other observation.
•  Random sampling does two things:
– Minimizes bias
– Permits measurement of sampling error
How to Take a Random Sample
7
1/14/13
Beware!
•  Be vigilant as to how the sample is collected.
•  Samples must be random and representative.
•  Avoid the sample of convenience (individuals
that are easily available to the researcher), which
is invariably biased.
Variables & Characters
Variable
The characteristics that differ among individuals. The
actual property measured on the individuals selected
for the sample. Most general term commonly used in
biological statistics.
Character
Synonym for variable. Used most by evolutionary
biologists & systematists.
Data Structure
Univariate
One variable is measured per observation
Bivariate
Two variables are measured per observation
Multivariate
Three or more variables are measured per
observation
8
1/14/13
Types of Variables
1. Measurement (Numerical) Variables
a. Continuous variables
b. Discontinuous variables
2. Ranked Variables
3. Categorical (Attribute) Variables
Measurement Variables
Those whose differing states can be expressed in
a numerically ordered fashion
Continuous variables are those that
have a theoretical infinite number of
finer gradations between any two
points (e.g., length, mass, size).
Dis-continuous (a.k.a. meristic) variables are those
with certain fixed discrete values with no
intermediates possible (e.g., number of leaves or
teeth).
Ranked & Attribute Variables
Ranked variables are those that can
not be measured, but can be ordered
(e.g., rank order of pupa emergence,
or, seed germination: 1,2,3, etc.).
Attribute variables (also known as categorical or
nominal variables) are those that cannot be
measured, but can be scored for
certain criteria (e.g., individual
dead/alive, or, flower color
red/white/pink).
9
1/14/13
OUTLINE
Review syllabus & introduction
Use & misuse of statistics
Statistics and biological data explained
Samples, populations, estimation
Intro to sampling design
Accuracy vs. precision
Types of variables
Frequency & probability distributions
Getting started: an example
Standard measures of central tendency
Mean, median, mode
Other means
Weighted, geometric, harmonic
Frequency Distributions
•  Different individuals (observations) in a sample
will have different measurements.
•  This variability is most easily seen with a
frequency distribution (histogram).
Probability Distributions
•  The frequency distribution describes the
number of times each value occurs in a sample.
•  The distribution of a variable in the whole
population is called the probability distribution.
•  For a continuous measurement variable, the
distribution is usually approximated by the
theoretical distribution known as the normal
distribution.
10
1/14/13
Normal Distribution
The Probability Density Function
The probability distribution can also be looked at as a
density function. We can use the calculus to describe
the areas under each part of the curve.
f(y)
What proportion of
measurements exists between
time intervals 4 & 6?
0.4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 16
y (time)
Probability Density Function
•  Our knowledge of the PDF for a standard normal
curve permits much of the inference we desire by
allowing us to assign levels of uncertainty
(probability) to our sample estimates.
•  We will describe the detailed properties of the
SNC and hypothesis testing in a subsequent
lecture.
11
1/14/13
OUTLINE
Review syllabus & introduction
Use & misuse of statistics
Statistics and biological data explained
Samples, populations, estimation
Intro to sampling design
Accuracy vs. precision
Types of variables
Frequency & probability distributions
Getting started: an example
Standard measures of central tendency
Mean, median, mode
Other means
Weighted, geometric, harmonic
Getting Started
Circumscribe the population
Collect a sample of observations
Measure one or more variables
Now what?
Key to Success:
12
1/14/13
Frequency Histogram
- Example -
From a population of oak seedlings in a
tree nursery
Collect a sample of N = 12 observations
The variable of interest is the height of seedlings
(cm)
Record the frequency of occurrences of heights
by category & construct a histogram…
Population
Sample
Frequency Histogram
- Example Sample: N = 12 seedlings
Height Categories (rounded to nearest cm)
2
3
4
5
6
7
8
Frequency Count by Group
1
1
2
4
1
1
2
13
1/14/13
Frequency Histogram
Frequency
- Example 5
y
f
4
2
1
3
1
4
2
2
5
4
1
6
2
0
7
1
8
1
3
2
3
4
5
6
7
8
Bin
Frequency Histogram
40
30
- Distributions -
20
10
0
1
2
3
4
5
6
7
40
30
20
Sample distributions may take a
variety of forms, some of which
include:
NORMAL (seedling example)‫‏‬
10
0
1
2
3
4
5
6
7
40
SKEWED
BIMODAL
30
20
We ll discuss these at length later
10
0
1
2
3
4
5
6
7
Plot The Data!
IMPORTANT:
Failure to plot the data
represents the single most
routine error that biometricians
encounter amongst students &
professionals analysing data!
There is no better way to
understand the structure &
distribution of your data.
14
1/14/13
OUTLINE
Review syllabus & introduction
Use & misuse of statistics
Statistics and biological data explained
Samples, populations, estimation
Intro to sampling design
Accuracy vs. precision
Types of variables
Frequency & probability distributions
Getting started: an example
Standard measures of central tendency
Mean, median, mode
Other means
Weighted, geometric, harmonic
Describing the Distribution
After visualizing your data (in this case producing
a histogram), you can begin the process of
estimation.
Now, assuming your data are normally distributed
(later we will discuss the ramifications of this and
what to do when this is not the case) you need a
method to describe the center of the distribution:
Central tendency (location of peak)‫‏‬
Measures of Central Tendency
The standard measures of central tendency are:
MODE: the most frequent observation
(from French, la mode, most fashionable )
MEDIAN: the observation in the middle, i.e., when
rank ordered; 50% above & 50% below
MEAN: the sum of all members of a sample divided
by the sample size, N.
15
1/14/13
Averages
CAUTION: The word average is NOT a
term rooted in the statistical sciences and
should arguably not be part of your scientific
vocabulary.
The mean, median, and mode are ALL
averages! The word average is a
synonym for central tendency .
The most commonly employed average
in the biological sciences is the arithmetic
mean.
The Symbology of Statistics
Y
Each observation is referred to as a variate Y
(or X depending upon source)‫‏‬
!Y
The Greek letter sigma is used as shorthand to
denote the sum of
N
!Y
i = 1
i
i is an iterator and for N = 10, Y1, Y2…Y10
This syntax is read as, the sum of the Yi s from
i = 1 to N
Mean
We can now define the arithmetic mean using our
new statistical lexicon:
N
Y =
∑Y
i =1
N
i
Which is a lot easier than
saying the mean (y-bar) is
equal to the sum of the
variates, Y, from i = 1 to N
divided by N.
16
1/14/13
Calculating the Mean
- Turning Symbols into Numbers -
Returning to the data from our frequency histogram:
N
!Y
Y
=
i = 1
Y
=
5
i
2 + 3+ 4 + 4 + 5+ 5+ 5+ 5+ 6 + 6 + 7 + 8
12
=
N
Example
- Measures of Central Tendency 5
Again, using the data from
the histogram example:
Frequency
4
Mean = 5
Median = 5
Mode = 5
3
2
1
0
2
3
4
5
6
7
8
Bin
This equality of averages
is one characteristic of a
bell-shaped or normal
distribution.
Example 3.1 Glide Snakes (Using R)
> Hertz<-c(0.9,1.4,1.2,1.2,1.3,2.0,1.4,1.6)
> Hertz
[1] 0.9 1.4 1.2 1.2 1.3 2.0 1.4 1.6
> hist(Hertz, col="red )
> mean(Hertz)‫‏‬
[1] 1.375
> median(Hertz)‫‏‬
[1] 1.35
> mode(Hertz)‫‏‬
[1] "numeric"
17
1/14/13
OUTLINE
Review syllabus & introduction
Use & misuse of statistics
Statistics and biological data explained
Samples, populations, estimation
Intro to sampling design
Accuracy vs. precision
Types of variables
Frequency & probability distributions
Getting started: an example
Standard measures of central tendency
Mean, median, mode
Other means
Weighted, geometric, harmonic
Types of Means
Arithmetic Mean
(what we have done so far)‫‏‬
Weighted Mean
Geometric Mean
Harmonic Mean
Weighted Mean
N
Yw
!w Y
i =1
=
i
i
N
i
!
=
1
wi
The larger N is, the more
reliable the mean
becomes as an estimator
of central tendency.
If you have two or more
samples of markedly
different N that you want
to combine and find a
grand mean, you need
to adjust for the Ns using
a weighting factor (w).
18
1/14/13
Weighted Mean
- Example Suppose you are interested in the mean
height of dogwood trees in Ohio s forests.
You go to 3 stands, set out a 500 m2 plot in
each area, and record the heights of all
dogwoods present, thus:
Stand
Mean
N
1
3.85
12
2
5.21
25
3
4.70
8
N
Yw =
i
∑
=
wi Yi
=
No--don t do it!
Weighted Mean
- Example -
1
N
∑w
i = 1
Yw
Your initial instinct
might be to take the
mean of the
arithmetic means.
i
(12)(3.85)
+ (25)(5.21) + (8)(4.70)
12 + 25 + 8
=
4.76
Thus, the mean height of Ohio dogwoods is 4.76 m.
Weighted Mean
- Example Notes:
1. The result of 4.76 is the same as had you
taken an arithmetic mean approach, but added
all of the original variates together, and divided
by N = 45 (as if all one sample).
2. Had you taken the arithmetic mean of the
three separate means, you would have
obtained an incorrect result of 4.59 (confirm this
for yourself).
19
1/14/13
Geometric Mean
Suppose you transformed your original variates from a
linear to a log10 scale prior to calculating the mean (we will
discuss why you might wish to do this in a subsequent
lecture).
If you calculate the mean of these transformed values and
then back-transformed the mean to a linear scale, this
value would be different from the arithmetic mean of the
original variates.
The back-transformed mean of a logarithmically
transformed variable is called the geometric mean.
Geometric Mean
GM Y =
N
N
i
!
=
1
Yi
The capital pi is read,
the product of just
like the sigma is read,
the sum of .
The geometric mean is
equal to the nth root of
the product of the Ys
from i = 1 to N.
Geometric Mean
- Example -
Suppose you had a data set: 2, 3, 3, 4, 15 (N = 5)
and wanted to know the central tendency.
The straight up arithmetic mean of these
observations would be 5.4 (and incorrect).
These values would be better log10 transformed
first, then averaged . Thus, the data becomes:
0.301, 0.477, 0.477, 0.622, 1.176. The arithmetic
mean of the logs is 0.607, which when backtransformed (100.607) = 4.043 (not 5.4!).
20
1/14/13
Geometric Mean
- Example Alternatively:
5 ( 2 )( 3)( 3)( 4 )(15 ) = 4.043
NB: which is the same result as if we had backtransformed the mean of the logs.
In this case, the geometric mean is the preferred
mean to report rather than the arithmetic mean.
Many people feel uncomfortable that this is not the
logical or best mean to report. I will try to allay
these fears in a subsequent lecture on data
transformations.
Harmonic Mean
Suppose that the transformation of choice is not
log10, but rather the reciprocal (1/Y), the mean of
choice would be the harmonic mean:
1
=
HY
N
1
i = 1
i
∑Y
N
Harmonic Mean
- Example -
Using the same data set as for the GM, the
sum of the reciprocals
divided by N = 0.297
therefore,
1/HY = 3.37,
Thus the harmonic mean = 3.37
Recall arithmetic mean = 5.4
21
1/14/13
22