Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Data Collection and Statistical Analysis in IB Biology Part II – Basic Stats , Standard Deviation and Variability John Gasparini The Munich International School Remember our two species of butterflies? http://www.zipcodezoo.com/hp350/Adelpha_basiloides_0.jpg Spot Celled Sister (Adelpha basiloides) http://4.bp.blogspot.com/-M8r6KZeMas/TWWe7aOgiWI/AAAAAAAAAP8/ck9cmUdqfas/s1600/Adelpha _cytherea_ButterflyPhotography-BB_Blogspot_JGJ.jpg Smooth-Banded Sister (Adelpha cytherea) Research Question "Is there a significant difference in proboscis length and body mass between A. basiloides and A. cytherea?” These are closely related species from the Nymphalidae family both are found in the tropics of Central America and both feed on the nectar of flowers. Imagine that you have collected data on the proboscis length and body mass of our two butterfly species. Record it properly. You must be neat to reduce problems later! Give the raw data tables proper titles Include uncertainties! Be consistent in your number of decimal places. Don’t use more than the sensitivity limits of your instrument. Imagine that you have collected data on the proboscis length and body mass of our two butterfly species. Record it properly. You must be neat to reduce problems later! What is the number of butterflies sampled for each species? What is the total number of butterflies sampled? Imagine that you have collected data on the proboscis length and body mass of our two butterfly species. Record it properly. You must be neat to reduce problems later! What is the number of butterflies sampled for each species? n = 15 What is the total number of butterflies sampled? Total sampled in both species = 30 Now that we have recorded our raw data in an organized fashion it is time to calculate some basic statistics for our datasets… We’ll start with three, "Measurements of Central Tendency.” Each is a summary score that tries in some way to represent a set of scores. It is a single score generated from a dataset that in some way is typical of the distribution of scores. 1) Mode 2) Median 3) Mean (average) Fancy name. Don’t get caught up in it. These are easy stats and you know most of them already. Mode: This is the score or value that occurs most frequently in a dataset. What is the Mode of this dataset? Mode: This is the score or value that occurs most frequently in a dataset. What is the Mode of this dataset? Answer: 23.5 Why? The value 23.5 occurs the most in the dataset – twice to be exact. Not very complicated… Mode: This is the score or value that occurs most frequently in a dataset. Datasets can be amodal, monomodal, bimodal and multimodal. (You should be able to figure out what these terms mean.) Note: this dataset is difference from the one before which was monomodal. Which of these terms would best describe the dataset to the left? Mode: This is the score or value that occurs most frequently in a dataset. Datasets can be amodal, monomodal, bimodal and multimodal. (You should be able to figure out what these terms mean.) Note: this dataset is difference from the one before which was monomodal. Which of these terms would best describe the dataset to the left? Answer = Amodal, as there is no repeating value Median: This is a middle point of scores in a dataset. 50%of the scores are above the median, and 50% are below it. The median is a point and it does not have to be and actual score in that distribution. What is the Median of this dataset? Think about what the median would be for a dataset with an even number of samples – e.g. Median value of the dataset 10, 7, 8 and 6? Median: This is a middle point of scores in a dataset. 50%of the scores are above the median, and 50% are below it. The median is a point and it does not have to be and actual score in that distribution. What is the Median of this dataset? = 23.2 Think about what the median would be for a dataset with an even number of samples – e.g. Median value of the dataset 10, 7, 8 and 6? = 7.5 Mean: This is the average value of the dataset and all of you should be able to calculate this easily… OK. So all of this is made terribly easy if you learn to use Excel properly. Click on the image below and watch the podcast on how to use Excel to calculate Modes, Medians, and Means within a spreadsheet. You need to master these skills. http://www.youtube.com/watch?v=ziQcGGBvH00&feature=youtu.be Now what we need to do is graph the data in Excel. This, too, is fairly easy. View the podcast below to see how this is done. Do not forget all of the rules that you have learned over the years on what is expected in terms of graphical presentation of data! (Remember these from 6th grade?) For Graphs… • • • • http://youtu.be/-WsEgIbfbug • • • Be neat, and make the graph large enough to be easily read. Use a pencil and a ruler, if constructing the graph by hand. Each axis should have a LABEL and the UNITS of measurement. The independent variable should be on the X-axis, and the dependent variable should be on the Y-axis. Scale the axes properly so that the data is effectively displayed. Use the appropriate type of graph - line graph, scatter plot, bar graph, etc. Data points should be properly positioned relative to the axes scales. Using Excel, we’ve generated the graph shown below... Now what does it tell us? How would you analyze these results? What conclusions would you draw in viewing this graph? What it tells us is that A. cytherea has a higher mean bill length than A. basiloides. But this is only part of the picture and is a 9th and 10th grade analysis of the datasets. Why? We need to go further in our statistical analysis because Mean values are not always accurate representative scores! Well… because the mean is a measure of the central tendency of the dataset, but it tells us NOTHING, NOTHING! about the spread of the data. The data points that we are analyzing could be tightly clustered around the mean or they could have high variability. Range is a simple and easy to compute measure of variability in a dataset: (Max sample value – Min sample value) = RANGE What is the RANGE of this small dataset? 54, 56, 67, 72, 19, 52, 56, 56, 66, 68, 57, 58, 63 Range is a simple and easy to compute measure of variability in a dataset: (Max sample value – Min sample value) = RANGE What is the RANGE of this small dataset? 54, 56, 67, 72, 19, 52, 56, 56, 66, 68, 57, 58, 63 (72 – 19) = 53 = RANGE This large range value suggest that there is a great deal of variability in our dataset, but here we can see that RANGE is also limited in that it tell us nothing about the variability within the distribution. ? When we plot out the dataset on a simple number line, one can see the flaw in relying just on the MEAN and RANGE values as measurements of central tendencies and variability: 56, 67, 72, 11, 56, 56, 66, 19, 68, 57, 58, 63 The Mean (X) of this dataset = 54.1 11 19 58 57 56 (X) 56 68 63 66 67 72 The vast majority of values are clustered around this end of the distribution. The mean is not in the middle of this cluster, at is has been affected by the outliers, 11 and 19. This dataset has a skewed distribution! +/- 1 s.d. = 68% of data! The greater the SD value the greater the variability! How do you calculate the standard deviation of a dataset? We are going to leave the mathematics behind this measure of variability to your math teachers, but you have to be able to calculate S.D. values in Excel. Follow the link to a podcast tutorial on using Excel to calculate standard deviation: http://youtu.be/90YWFllx1EA Error bars are a graphical representation of the variability of data. Error bars can be used to represent range, standard deviation or other measures of variability. In IB Biology STANDARD DEVIATION ERROR BARS will be most useful. Error bars are a graphical representation of the variability of data. Error bars can be used to represent range, standard deviation or other measures of variability. In IB Biology STANDARD DEVIATION ERROR BARS will be most useful. SET A – the bar (mean) for A is higher than B SET B – the S.D. error bar is longer for B than A How do you put standard deviation error bars on the graphs that you generate? Follow the link to a podcast tutorial on putting error bars on graphs in Excel: http://youtu.be/oV0vbQlp9AI What do error bars tell us? The overlap of error bars gives us a clue as to the significance of the results! Overlap! No overlap LOTS OF OVERLAP = LOTS OF SHARED DATA NO OVERLAP = VERY LITTLE SHARED DATA Results are NOT LIKELY TO BE SIGNIFICANTLY DIFFERENT! The difference between means is most likely due to chance Results ARE LIKELY TO BE SIGNIFICANTLY DIFFERENT! The difference between means is most likely to be REAL a. SET B b. SET B c. SET A d. SET B e. SET A Let’s look back at our original data and try to answer the first half of our research question. "Is there a significant difference in proboscis length between A. basiloides and A. cytherea?” Now, given your knowledge about what the standard deviation of a dataset represents, what should your conclusion be in regards to the proboscis lengths of A. cytherea and A. basiloides? Let’s look back at our original data and try to answer the first half of our research question. Lots of overlap in SD error bars! "Is there a significant difference in proboscis length between A. basiloides and A. cytherea?” NO! The two datasets Now, given your knowledge about what the standard deviation of a dataset represents, what should your conclusion be in regards to the proboscis lengths of A. cytherea and A. basiloides? contain too much shared data to conclusively state that a significant difference exists between the proboscis lengths of these butterflies. But what about when we look at the mean body mass values for the two species? ? There is some overlap. This one is hard to call. We need another statistical test to tell us if there is a difference in these data sets. Something more refined…