Download Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Name: ____________________________ Date: ________________ Class: ________ Seat: ____________
Using Descriptive Statistics in Biology
Introduction to Descriptive Statistics
Scientists typically collect data on a sample of a population
and use these data to draw conclusions or make inferences
about the entire population. Descriptive statistics allows you
to describe and quantify differences among data sets.
Descriptive statistics, such as mean, median, mode, and
range can help to highlight trends or patterns in the data.
Each of these statistics is appropriate to certain types of
data or distributions, e.g. a mean is not appropriate for data
with a skewed distribution. Frequency graphs are useful for
indicating the distribution of data. Standard deviation and
standard error are statistics used to quantify the amount of
spread in the data and evaluate the reliability of estimates of
the true (population) mean.
Variation in Data
Whether they are obtained from observation or experiments,
most biological data show variability. In a set of data values, it
is useful to know the value about which most of the data are
grouped; the center value. This value can be the mean,
median or mode depending on the type of variable involved.
The main purpose of these statistics is to summarize important trends in your data and to provide the basis for
statistical analyses.
Statistic
Definition and Use
Method of Calculation
Mean
 The average of all data entries
 Add up all the data entries
 Measure of central tendency for
 Divide by the total number of data entries
normally distributed data
Median
 The middle value when data entries
 Arrange the data in increasing rank order
are placed in rank order
 Identify the middle value
 A good measure of central tendency
 For an even number of entries, find the
for skewed distributions
midpoint of the two middle values
Mode
 The most common data value
 Identify the category with the highest number
of data entries using a tally chart or a bar
 Suitable for bimodal distributions and
graph
qualitative data
Range
 The difference between the smallest
 Identify the smallest and largest values and
and largest data values
find the difference between them
 Provides a crude indication of data
spread
Distribution of Data
Variability in continuous data is often displayed as a frequency
distribution. A frequency plot will indicate whether the data have a
normal distribution (A), with a symmetrical spread of data about the
mean, or whether the distribution is skewed (B), or bimodal (C). The
shape of the distribution will determine which statistic (mean, median
or mode) best describes the central tendency of the sample data.
When to NOT calculate a mean:
a. Do NOT calculate a mean from values that are already
means (averages) themselves.
b. Do NOT calculate a mean of ratios (e.g. percentages) for
several groups of different sizes; go back to the raw values
and recalculate
c.
Do NOT calculate a mean when the measurement scale is
not linear (e.g. pH units are not measured on a linear scale).
Measuring Spread





The standard deviation is a frequently used measure of the variability (spread) in a set of data.
Usually presented in the form 𝑥̅ ± 𝑠. If the mean is 10 and the standard deviation is calculated to be 2 then you
would show the data as 10 ± 2.
In a normally distributed set of data,
o 68% of all data values will lie within one
standard deviation (s) of the mean (𝑥̅ )
o 95% of all data values will lie within two
standard deviations of the mean.
A large standard deviation indicates that the
data have a lot of variability.
A small sample standard deviation indicates
that the data are clustered close to the sample
mean and has less variability.
Page 2 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
In the example above, the mean height of the bean plants was 103 mm ± 11.7. What does this tell us? In a data
set with a large number of measurements that are normally distributed, 68.3% of the measurements are
expected to fall within 1 standard deviation of the mean and 95.4% of all data points lie within 2 standard
deviation of the mean on either side. Thus, in this example, if you assume that this sample of 17 observations is
drawn from a population of measurements that are normally distributed, 68.3% of the measurements in the population
should fall between 91.3 and 114.7 millimeters and 95.4% of the measurements should fall between 80.1 and 125.9
millimeters.
Page 3 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
We can graph the mean and standard deviation of this sample of bean plants using a bar graph with error bars.
Standard deviation bars summarize the variation in the data—the more spread out the individual measurements
are, the larger the standard deviation. As sample size increases, standard deviation will become a more accurate
estimate of the standard deviation of the population.
Understanding Degrees of Freedom
Calculations of sample estimates, such as the standard deviation and variance, use degrees of freedom instead of
sample size. The way you calculate degrees of freedom depends on the statistical method you are using, but for
calculating the standard deviation, it is defined as 1 less than the sample size (n-1).
Example: Biologists are interested in variation in leg sizes among grasshoppers. They catch five grasshoppers (n=5)
in a net and prepare to measure the left legs. As the scientists pull grasshoppers one at a time from the net, they have
no way of knowing the leg lengths until they measure them all. In other words, all five leg lengths are free to vary
within some general range for this particular species. The scientists measure all five leg lengths and then calculate the
mean to be x = 10mm. They then place the grasshoppers back in the net and decide to pull them out one at a time to
measure them again. This time, since the biologists already know the mean to be 10, only the first four measurements
are free to vary within a given range. If the first four measurements are 8, 9 ,10 and 12 mm, then there is no freedom
for the fifth measurement to vary; it has to be 11. Thus, notice they know the sample mean, the number of degrees of
freedom is 1 less than the sample size, df = 4.
Two different sets of data can have the same mean and range, yet the distribution of data within the range can
be quite different.
In both the data sets pictured in the histograms below, 68% of the values lie within the range 𝑥̅ ± 1𝑠 and 95% of the
values lie within 𝑥̅ ± 2𝑠. However, in B, the data values are more tightly clustered around the mean.
Page 4 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
Calculating Standard Deviation:
Set up a table like the one below to easily calculate standard deviation.
Calculating Standard Deviation Example:
Data: 2, 5, 9, 12, 15, 17
Calculate mean: 2 + 5 + 9 + 12 + 15 + 17 = 60
Use value from table to calculate s:
168
𝑠=√
= √33.6 = 5.8
6−1
𝑥̅ ± 𝑠
10 ± 5.8
60/6=10
𝒙
2
5
9
12
15
17
̅
𝒙−𝒙
2-10
5-10
9-10
12-10
15-10
17-10
(𝒙 − 𝒙
̅)𝟐
(2-10)2
(5-10)2
(9-10)2
(12-10)2
(15-10)2
(17-10)2
64
25
1
4
25
49
168
Page 5 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
Practice Calculating Descriptive Statistics
1.
A survey of the number of spores found on the fronds of a fern plant was conducted. The data is listed below:
Raw data: Number of spores per frond
64
60
64
62
68
66
63
69
70
63
70
70
63
62
71
69
59
70
66
61
70
67
64
63
64
Calculate each of the following—show work for all!
a. Mean
b. Median
c.
Mode
d. Range
e. Standard deviation
Page 6 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
Reliability of the Mean or Measures of Confidence
You have already seen how to use the standard deviation (s) to quantify the spread or dispersion in your data. The
variance (𝑠 2 ) is another such measure of dispersion, but the standard deviation is usually the preferred of these two
measures because it is expressed in the original units. Usually you will also want to know how good your sample
mean (𝑥̅ ) is an estimate of the true population mean (µ). This can be indicated by the standard error of the mean
(or just standard error—SE). SE is often used as an error measurement simply because it is small, rather than for any
good statistical reason. However, it does allow you to calculate the 95% confidence interval (95% CI).
When we measure a particular attribute from a sample of a larger population and calculate a mean for that attribute,
we can calculate how closely our sample mean (the statistic) is to the true population mean for that attribute (the
parameter). For example: if we calculated the mean number of carapace spots from a sample of six ladybird beetles,
how reliable is this statistic as an indicator of the mean number of carapace spots in the whole population? We can
find out by calculating the 95% confidence interval.
Reliability of the Sample Mean—Standard Error of the Mean
When we take measurements from samples of a large population, we are using those samples as indicators of the
trends in the whole population. Therefore, when we calculate a sample mean, it is useful to know how close that
value is to the true population mean. This is not merely an academic exercise; it will enable you to make inferences
about the aspect of the population in which you are interested. For this reason, statistics based on samples and used
to estimate population parameters are called inferential statistics.
Example: Assume that there is a population of a species of anole lizards living on an island of the Caribbean. If you
were able to measure the length of the hind limbs of every individual in this population and then calculate the mean,
you would know the value of the population mean. However, there are thousands of individuals, so you take a
sample of 10 anoles and calculate the mean hind limb length for that sample. Another researcher working on that
island might catch another sample of 10 anoles and calculate the mean hind limb length for this sample and so on.
The sample means of many different samples would be normally distributed. The standard error of the mean
(SEM or 𝑆𝐸𝑥̅ )represents the standard deviation of such a distribution and estimates how close the sample mean is to
the population mean. The greater each sample size, the more closely the sample mean will estimate the population
mean and therefore the standard error of the mean becomes smaller.
Calculating Standard Error of the Mean
The standard error is simple to calculate and is usually a small value. SE is given by:
𝒔
𝑺𝑬 =
√𝒏
Where s = standard deviation and n = sample size.
The standard error of the mean tells you that about 68% of the sample means
would be within ±1 standard error of the population mean and 95% would be
within ±2 standard errors.
95% Confidence Interval
Another more precise measure of the uncertainty in the mean is the 95%
confidence interval (95%CI). This value is usually written as mean ± 95%CI. A
95% confidence limit tells you that, on average, 95 times out of 100, the limits
will contain the true population mean.
Once researchers have developed a hypothesis, designed an experiment,
collected data and applied a number of descriptive statistics that summarize the
data visually, they can apply the standard error statistic as an inference to
describe the confidence they have that the means of the sample represent the
true means.
Page 7 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
Note about error bars: Many bar graphs include error bars, which may represent standard deviation, SEM or 95%
CI. When the bars represent SEM, you know that if you took many samples only about 2/3 of the error bars would
include the population mean. This is very different from standard deviation bars which show how much variation
there is among individual observations in a sample. When the error bars represent 95% CI in a graph, you know that
in about 95% of the cases the error bars include the population mean. If a graph shows error bars that represent SEM,
you can estimate the 95% CI by making the bars twice as big—this is a fairly accurate approximation for large sample
sizes, but for small samples the 95% CI are actually more than twice as big as the SEMs.
Example:
Seeds of many weed species germinate best in recently disturbed soil that lacks a light blocking canopy of vegetation.
Students in a biology class hypothesized that weed seeds germinate best when exposed to light. To test this
hypothesis, the students placed a seed from crofton weed (Ageratina adenophora, an invasive species on several
continents) in each of 20 petri dishes and covered the seeds with distilled water. They placed half the petri dishes in
the dark and half in the light. After one week, the students measured the combined lengths in millimeters of the
radicles and shoots extending from the seeds in each dish. The table below shows the data.
Given the information in the table above, calculate the following—SHOW YOUR WORK!
1. Standard Deviation
2. Standard Error of the Mean
3. 95% CI
Page 8 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
4. Graph the means with the SEM
5. Graph the means with the 95% CI
Page 9 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
6. Based on the results shown in the table, do we know the actual mean combined radicle and shoot
length of the entire population of crofton plants in the dark? _____ Justify your response.
7. Use the SEM values to explain what the data show for crofton plants.
8. Are the true population means of the light and dark treatments different from one another? ____
Justify your response.
9. Describe the difference between standard error and standard deviation—include in your discussion
the situations when you would use each.
Page 10 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.
Descriptive Statistics Practice
A student investigated the variation in the length of bivalve shells at two
locations on a rocky shore.
Show ALL WORK!!!
State the Explanatory Hypothesis:
Data Collected
Shell Length in mm
Group A
Group B
46
23
50
28
45
41
45
31
63
26
57
33
65
35
73
21
55
38
79
30
62
36
59
38
71
45
68
28
77
42
Complete the table below:
Group A
Mean
Median
Mode
Range
SE
95% CI
Group B
Based on the statistics calculated on the previous page, what can you conclude?
What does this data and the statistics tell us about the two sets of bivalves and their environment?
Page 11 of 11
Adapted from
Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.