Download Descriptive Statistics

Name: ____________________________ Date: ________________ Class: ________ Seat: ____________ Using Descriptive Statistics in Biology Introduction to Descriptive Statistics Scientists typically collect data on a sample of a population and use these data to draw conclusions or make inferences about the entire population. Descriptive statistics allows you to describe and quantify differences among data sets. Descriptive statistics, such as mean, median, mode, and range can help to highlight trends or patterns in the data. Each of these statistics is appropriate to certain types of data or distributions, e.g. a mean is not appropriate for data with a skewed distribution. Frequency graphs are useful for indicating the distribution of data. Standard deviation and standard error are statistics used to quantify the amount of spread in the data and evaluate the reliability of estimates of the true (population) mean. Variation in Data Whether they are obtained from observation or experiments, most biological data show variability. In a set of data values, it is useful to know the value about which most of the data are grouped; the center value. This value can be the mean, median or mode depending on the type of variable involved. The main purpose of these statistics is to summarize important trends in your data and to provide the basis for statistical analyses. Statistic Definition and Use Method of Calculation Mean  The average of all data entries  Add up all the data entries  Measure of central tendency for  Divide by the total number of data entries normally distributed data Median  The middle value when data entries  Arrange the data in increasing rank order are placed in rank order  Identify the middle value  A good measure of central tendency  For an even number of entries, find the for skewed distributions midpoint of the two middle values Mode  The most common data value  Identify the category with the highest number of data entries using a tally chart or a bar  Suitable for bimodal distributions and graph qualitative data Range  The difference between the smallest  Identify the smallest and largest values and and largest data values find the difference between them  Provides a crude indication of data spread Distribution of Data Variability in continuous data is often displayed as a frequency distribution. A frequency plot will indicate whether the data have a normal distribution (A), with a symmetrical spread of data about the mean, or whether the distribution is skewed (B), or bimodal (C). The shape of the distribution will determine which statistic (mean, median or mode) best describes the central tendency of the sample data. When to NOT calculate a mean: a. Do NOT calculate a mean from values that are already means (averages) themselves. b. Do NOT calculate a mean of ratios (e.g. percentages) for several groups of different sizes; go back to the raw values and recalculate c. Do NOT calculate a mean when the measurement scale is not linear (e.g. pH units are not measured on a linear scale). Measuring Spread      The standard deviation is a frequently used measure of the variability (spread) in a set of data. Usually presented in the form 𝑥̅ ± 𝑠. If the mean is 10 and the standard deviation is calculated to be 2 then you would show the data as 10 ± 2. In a normally distributed set of data, o 68% of all data values will lie within one standard deviation (s) of the mean (𝑥̅ ) o 95% of all data values will lie within two standard deviations of the mean. A large standard deviation indicates that the data have a lot of variability. A small sample standard deviation indicates that the data are clustered close to the sample mean and has less variability. Page 2 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. In the example above, the mean height of the bean plants was 103 mm ± 11.7. What does this tell us? In a data set with a large number of measurements that are normally distributed, 68.3% of the measurements are expected to fall within 1 standard deviation of the mean and 95.4% of all data points lie within 2 standard deviation of the mean on either side. Thus, in this example, if you assume that this sample of 17 observations is drawn from a population of measurements that are normally distributed, 68.3% of the measurements in the population should fall between 91.3 and 114.7 millimeters and 95.4% of the measurements should fall between 80.1 and 125.9 millimeters. Page 3 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. We can graph the mean and standard deviation of this sample of bean plants using a bar graph with error bars. Standard deviation bars summarize the variation in the data—the more spread out the individual measurements are, the larger the standard deviation. As sample size increases, standard deviation will become a more accurate estimate of the standard deviation of the population. Understanding Degrees of Freedom Calculations of sample estimates, such as the standard deviation and variance, use degrees of freedom instead of sample size. The way you calculate degrees of freedom depends on the statistical method you are using, but for calculating the standard deviation, it is defined as 1 less than the sample size (n-1). Example: Biologists are interested in variation in leg sizes among grasshoppers. They catch five grasshoppers (n=5) in a net and prepare to measure the left legs. As the scientists pull grasshoppers one at a time from the net, they have no way of knowing the leg lengths until they measure them all. In other words, all five leg lengths are free to vary within some general range for this particular species. The scientists measure all five leg lengths and then calculate the mean to be x = 10mm. They then place the grasshoppers back in the net and decide to pull them out one at a time to measure them again. This time, since the biologists already know the mean to be 10, only the first four measurements are free to vary within a given range. If the first four measurements are 8, 9 ,10 and 12 mm, then there is no freedom for the fifth measurement to vary; it has to be 11. Thus, notice they know the sample mean, the number of degrees of freedom is 1 less than the sample size, df = 4. Two different sets of data can have the same mean and range, yet the distribution of data within the range can be quite different. In both the data sets pictured in the histograms below, 68% of the values lie within the range 𝑥̅ ± 1𝑠 and 95% of the values lie within 𝑥̅ ± 2𝑠. However, in B, the data values are more tightly clustered around the mean. Page 4 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. Calculating Standard Deviation: Set up a table like the one below to easily calculate standard deviation. Calculating Standard Deviation Example: Data: 2, 5, 9, 12, 15, 17 Calculate mean: 2 + 5 + 9 + 12 + 15 + 17 = 60 Use value from table to calculate s: 168 𝑠=√ = √33.6 = 5.8 6−1 𝑥̅ ± 𝑠 10 ± 5.8 60/6=10 𝒙 2 5 9 12 15 17 ̅ 𝒙−𝒙 2-10 5-10 9-10 12-10 15-10 17-10 (𝒙 − 𝒙 ̅)𝟐 (2-10)2 (5-10)2 (9-10)2 (12-10)2 (15-10)2 (17-10)2 64 25 1 4 25 49 168 Page 5 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. Practice Calculating Descriptive Statistics 1. A survey of the number of spores found on the fronds of a fern plant was conducted. The data is listed below: Raw data: Number of spores per frond 64 60 64 62 68 66 63 69 70 63 70 70 63 62 71 69 59 70 66 61 70 67 64 63 64 Calculate each of the following—show work for all! a. Mean b. Median c. Mode d. Range e. Standard deviation Page 6 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. Reliability of the Mean or Measures of Confidence You have already seen how to use the standard deviation (s) to quantify the spread or dispersion in your data. The variance (𝑠 2 ) is another such measure of dispersion, but the standard deviation is usually the preferred of these two measures because it is expressed in the original units. Usually you will also want to know how good your sample mean (𝑥̅ ) is an estimate of the true population mean (µ). This can be indicated by the standard error of the mean (or just standard error—SE). SE is often used as an error measurement simply because it is small, rather than for any good statistical reason. However, it does allow you to calculate the 95% confidence interval (95% CI). When we measure a particular attribute from a sample of a larger population and calculate a mean for that attribute, we can calculate how closely our sample mean (the statistic) is to the true population mean for that attribute (the parameter). For example: if we calculated the mean number of carapace spots from a sample of six ladybird beetles, how reliable is this statistic as an indicator of the mean number of carapace spots in the whole population? We can find out by calculating the 95% confidence interval. Reliability of the Sample Mean—Standard Error of the Mean When we take measurements from samples of a large population, we are using those samples as indicators of the trends in the whole population. Therefore, when we calculate a sample mean, it is useful to know how close that value is to the true population mean. This is not merely an academic exercise; it will enable you to make inferences about the aspect of the population in which you are interested. For this reason, statistics based on samples and used to estimate population parameters are called inferential statistics. Example: Assume that there is a population of a species of anole lizards living on an island of the Caribbean. If you were able to measure the length of the hind limbs of every individual in this population and then calculate the mean, you would know the value of the population mean. However, there are thousands of individuals, so you take a sample of 10 anoles and calculate the mean hind limb length for that sample. Another researcher working on that island might catch another sample of 10 anoles and calculate the mean hind limb length for this sample and so on. The sample means of many different samples would be normally distributed. The standard error of the mean (SEM or 𝑆𝐸𝑥̅ )represents the standard deviation of such a distribution and estimates how close the sample mean is to the population mean. The greater each sample size, the more closely the sample mean will estimate the population mean and therefore the standard error of the mean becomes smaller. Calculating Standard Error of the Mean The standard error is simple to calculate and is usually a small value. SE is given by: 𝒔 𝑺𝑬 = √𝒏 Where s = standard deviation and n = sample size. The standard error of the mean tells you that about 68% of the sample means would be within ±1 standard error of the population mean and 95% would be within ±2 standard errors. 95% Confidence Interval Another more precise measure of the uncertainty in the mean is the 95% confidence interval (95%CI). This value is usually written as mean ± 95%CI. A 95% confidence limit tells you that, on average, 95 times out of 100, the limits will contain the true population mean. Once researchers have developed a hypothesis, designed an experiment, collected data and applied a number of descriptive statistics that summarize the data visually, they can apply the standard error statistic as an inference to describe the confidence they have that the means of the sample represent the true means. Page 7 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. Note about error bars: Many bar graphs include error bars, which may represent standard deviation, SEM or 95% CI. When the bars represent SEM, you know that if you took many samples only about 2/3 of the error bars would include the population mean. This is very different from standard deviation bars which show how much variation there is among individual observations in a sample. When the error bars represent 95% CI in a graph, you know that in about 95% of the cases the error bars include the population mean. If a graph shows error bars that represent SEM, you can estimate the 95% CI by making the bars twice as big—this is a fairly accurate approximation for large sample sizes, but for small samples the 95% CI are actually more than twice as big as the SEMs. Example: Seeds of many weed species germinate best in recently disturbed soil that lacks a light blocking canopy of vegetation. Students in a biology class hypothesized that weed seeds germinate best when exposed to light. To test this hypothesis, the students placed a seed from crofton weed (Ageratina adenophora, an invasive species on several continents) in each of 20 petri dishes and covered the seeds with distilled water. They placed half the petri dishes in the dark and half in the light. After one week, the students measured the combined lengths in millimeters of the radicles and shoots extending from the seeds in each dish. The table below shows the data. Given the information in the table above, calculate the following—SHOW YOUR WORK! 1. Standard Deviation 2. Standard Error of the Mean 3. 95% CI Page 8 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. 4. Graph the means with the SEM 5. Graph the means with the 95% CI Page 9 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. 6. Based on the results shown in the table, do we know the actual mean combined radicle and shoot length of the entire population of crofton plants in the dark? _____ Justify your response. 7. Use the SEM values to explain what the data show for crofton plants. 8. Are the true population means of the light and dark treatments different from one another? ____ Justify your response. 9. Describe the difference between standard error and standard deviation—include in your discussion the situations when you would use each. Page 10 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology. Descriptive Statistics Practice A student investigated the variation in the length of bivalve shells at two locations on a rocky shore. Show ALL WORK!!! State the Explanatory Hypothesis: Data Collected Shell Length in mm Group A Group B 46 23 50 28 45 41 45 31 63 26 57 33 65 35 73 21 55 38 79 30 62 36 59 38 71 45 68 28 77 42 Complete the table below: Group A Mean Median Mode Range SE 95% CI Group B Based on the statistics calculated on the previous page, what can you conclude? What does this data and the statistics tell us about the two sets of bivalves and their environment? Page 11 of 11 Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Descriptive Statistics