Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LESSON 4.4 WORKBOOK Seeing through the static — How do we identify correlations in data? Number of Shark A4acks In this lesson we will learn more about some of the statistics that are used in scientific research. We will see how researchers sort through scientific data in search of potential factors that link diet to health outcomes. We will explore how primary data addresses questions like ‘what is a significant correlation?’ and ‘How might this data impact our ideas about factors that contribute to or protect against heart disease?’ Wo r k b o o k Lesson 4.4 Correlation does not mean causation! What is the difference between two variables that are correlated, and a cause and effect relationship? Ice Cream Sales Figure 1: Just because ice cream sales and shark attacks are correlated, it does not mean that ice cream sales cause shark attacks! Have you ever heard people declare that every time they wash their car, it rains? While this may seem to be true, it certainly is not likely that washing a car is what is causing rain to fall. There may be a correlation between timing of the car wash and the weather, but there is no causal relationship. This delineation may sound simplistic, but the confusion between correlation and causation can puzzle even the most well versed scientists. ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ 166 LESSON READINGS A correlation is simply an association between two factors. On the other hand, a causal relationship means that one factor depends on the other. As we saw in the last lesson, the type of study determines whether the results can show correlative or causative relationships; observational studies can only provide correlative data, whereas interventional studies can provide causative data. The type of study determines the type of results: The chicken definitely came first, I think? Observational studies can only provide correlative data, whereas interventional studies can provide causative data. Because many human studies in nutrition research are observational, they can only show correlations between a nutrient and a health outcome. Additionally, many causative relationships have only been demonstrated by an animal model or cell culture study. So how can we make educated decisions about what the results mean for humans trying to live a healthy life? In order to prove that one variable (the independent variable) causes another (the dependent variable), the timing has to be right. This means that the change in the dependent variable must follow the change in the independent variable. If your independent variable was a meal, for example, and the dependent variable was blood glucose concentrations, you would see that after someone consumes a meal their blood glucose concentrations will rise. Correlative data often gets misinterpreted as causative data Wo r k b o o k Lesson 4.4 Figure 2: Eating food coloring is correlated with hyperactivity but may not cause it. Some nutrients and foods get a reputation for being 'bad', while others are known as 'good'. Most of these assumptions are based on correlative data. There are many instances where consumption of a nutrient or food correlates with some health outcome. For instance, increased consumption of some artificial food colorings correlates with hyperactivity in children, but the mechanism that would explain how the food coloring is causing hyperactivity has not been established. Even so, the belief that the food colorings are the 'cause' for hyperactivity is strong enough to make parents avoid the chemicals, and has led the FDA to review the safety of food dyes. When reading a news article describing study results, ask yourself whether the results are truly demonstrating a cause and effect, or whether there might be alternative explanations for the results. ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ 167 LESSON READINGS Statistics are used to measure correlation, not causation Strong results come from good sample selection DEFINITIONS OF TERMS Mean — The mathematical average of all samples in a data set. Median — The numerical value that separates the higher half of data from the lower half. Population — An entire collection of people or animals of interest from which data is collected. Representative sample — A subset of the greater population of interest that accurately reflects the members of the entire population. Standard deviation (STD) — Indicates the variability or deviation for an experimental group. For a complete list of defined terms, see the Glossary. Wo r k b o o k Lesson 4.4 If we wanted to learn what the common blood glucose concentration is after eating a specific food, we could not possibly measure every single person’s blood glucose. How then do we gather generalizable research findings? One way to do this is by selecting a Figure 3: You can't measure representative sample. This sample should include a everyone so you need to select a random selection of people that represent the greater sample group to study. population. We use samples as estimates of the whole population, but we will never be able to know the results of an entire population without making the measurement on everybody. It is common to find different results when using different representative samples simply because of the variability among human participants. If the representative sample chosen was truly characteristic of the greater population, it should provide a reliable estimate of what results we can expect in the population. Let's say you pick a perfect sample group, how do you know if your findings are real rather than random? If a person flips a coin 100 times and gets 85 heads, what are the odds that that would happen? An important concept is that statistics can be used to measure the probability that your findings are sheer dumb luck! Common statistical methods Learning about statistics may seem daunting at first, but it can be easily grasped after we learn a few simple concepts. Some key ways we can summarize data include: Mean Family Income ■■ Average/Mean: The value that represents the calculated central value of a data set. Calculated by adding up all of the values and dividing by how many numbers there are. Median Family Income ■■ Standard Deviation (STD): A measure of the variability of the data around the mean. If you have ten observations that are identical the STD is small. If you have ten observations that are all different the STD will be large. 60,000 ≠ 44,000 Figure 4: If data is not normally distributed, mean and median can be very different. ■■ Median: The middle number in a set of data if you sort the numbers from smallest to largest. This helps you find the most common observation rather than the average one. For example, in the US the average family income is 60,000 dollars but the median family income is 44,000 dollars. Why are they so different? Think Bill Gates! 1. What is the difference between a representative sample and a population? aa. The sample is smaller. bb. The population includes every person of interest. cc. The sample is used in scientific studies. dd. All of the above. 2. What is the likely cause of a sample population not representing the entire population? aa. The sample size is too small. bb. Selection bias of the sample. cc. The sample was not randomly selected. dd. A & B only. ee. B & C only. ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ 168 LESSON READINGS DEFINITIONS OF TERMS Normal distribution — Also called a 'bell curve'. Represents the distribution of many random variables as a symmetrical bellshaped graph. For a complete list of defined terms, see the Glossary. Wo r k b o o k Lesson 4.4 When we plot a graph to show the distribuAverage$ tion of data they can make different shapes. Sometimes data can be skewed to one side, or have several peaks and valleys. Often the Bell$Curve$ distribution of data will make the shape of a hand bell, with no bias to the left or the right (as shown in Figure 5). When this occurs the majority of data will fall in the middle of the graph, with a symmetrical tapering as you move away from the center. This type of distribution is called a normal distribuFigure 5: The normal distribution looks like a tion, or a bell-shaped curve. Many simple bell, and is sometimes called a 'bell curve'. examples of a data set will make a normal distribution, such as the heights of a group of people, blood pressure, or the grades in a classroom. The shape of the distribution is an important thing to consider when you of compare data from two groups. When the distribution is normal the average will be close to the median, but when the distribution is skewed the average will differ from the median, just like the income example given above. Comparing groups As long as the distribution of data in both experimental groups is the same shape, we can use statistics to measure how different the two groups are. If the two groups of data do not differ from one another, they would overlap on a graph when plotted. The more different the data sets are, the further apart their distriControl Treatment butions will appear on a graph, as in Figure 6. After Group Mean Group Mean statistically determining whether the data distribution of the experimental groups is the same or different, you Figure 6: Group means can be will be able to either accept or reject your hypothesis. compared if both groups have similar Let's again use blood glucose concentrations after distributions. a meal as an example. An experimental hypothesis may be that people who are obese will have higher blood glucose concentrations two hours after a meal than people who are lean. Statistics will be able to determine how different the blood glucose levels are between the groups. But remember, statistics only measures how different groups are not whether the variables are correlated or causative! 3. The mean and the median are the same: aa. In all data sets. bb. If the data are distributed normally. cc. If the distribution of the data is skewed. dd. All of the above. ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ 169 LESSON READINGS DEFINITIONS OF TERMS Coefficient of determination — Also called 'r-squared'. Indicates how well data points fit a straight line. P-value — The probability of obtaining a test statistic at least as extreme as the one observed if the hypothesis were false. For a complete list of defined terms, see the Glossary. Wo r k b o o k Lesson 4.4 Figure 7: Statistical significance shows how related or different data sets are, not how likely the hypothesis is to be correct. A statistical measure called a p-value helps you determine the significance of your results. As a reader, you can look for the p-value to determine if the differences between groups are significant. The smaller a p-value is, the greater the difference is between the groups. For example, if you do the blood glucose concentration experiment above and find a p-value of 0.01, that means that there is a probability of 1 in 100 that these observation happened randomly. In other words, there is a 1 in a 100 chance that the blood glucose concentrations of the two groups are the same, so we can say with confidence that the groups are significantly different. For biological research, a p-value of 0.05 or less is considered significant. Statistics can be used to describe correlations We have talked a lot about correlations, but what does a Posi%ve No Nega%ve correlation look like? Variables Correla%on Correla%on Correla%on can be positively or negatively correlated. A positive correlation means that as the value of one variable increases, so does the value of the other factor. The temperature outside may be positively correlated with your Figure 8: When two variables are plotted they can somedesire to go to the beach, for times have a linear relationship, which demonstrates correlation. If there is no linear shape, there is no correlation. example, whereas the temperature is probably negatively (or inversely) correlated with your desire to cozy up next to a fire. The coefficient of determination, or r2 (r-squared), is used to describe the variability of the data from the linear correlation. In other words, if you plot the variables in against each other do you get a line or a blob? Data that is very scattered (blob like) when plotted will have an r2 close to zero, and data that makes an almost perfect line when plotted will have an r2 close to 1. You can think of r2 as representing the percent of the data that is closes to a line of best fit. For example, if the r2 = 0.75, then 75% of the data is in a linear relationship with one another. 4. A significant p-value means causation has been proved. aa. True. bb. False. 5. Two variables are correlated when: aa. One causes a change in the other. bb. A statistical difference isn't found. cc. They have a linear relationship when plotted. dd. Their coefficient of determination (r2) is less than 0.01. ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ ________________________________ 170 STUDENT RESPONSES Let's say you are observing two groups, one with a BMI over 30 and the other with a BMI between 20 and 25. You find that the BMI 30+ group has higher LDL cholesterol than the 20-25 group with a p-value of 0.001. What type of conclusions can you make based on this information? (Please draw graph(s) to help with to explanation.) __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ Remember to identify your sources __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ __________________________________________________________ _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ Wo r k b o o k Lesson 4.4 _____________________________________________________________________________________________________ ___________________________________________________________________________________________ 171 TERMS TERM For a complete list of defined terms, see the Glossary. Wo r k b o o k Lesson 4.4 DEFINITION Coefficient of Determination Also called 'r-squared'. Indicates how well data points fit a straight line. Mean The mathematical average of all samples in a data set. Median The numerical value that separates the higher half of data from the lower half. Normal Distribution Also called 'bell curve'. Represents the distribution of many random variables as a symmetrical bell-shaped graph. P-Value The probability of obtaining a test statistic at least as extreme as the one observed if the hypothesis were false. Population An entire collection of people or animals of interest from which data is collected. Representative Sample A subset of the greater population of interest that accurately reflects the members of the entire population. Standard Deviation Indicates the variability or deviation for an experimental group. 172