Download LESSON 4.4 WORKBOOK Seeing through the static — How do

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
LESSON 4.4 WORKBOOK
Seeing through the static — How do
we identify correlations in data?
Number of Shark A4acks In this lesson we will learn more about some of
the statistics that are used in scientific research.
We will see how researchers sort through scientific data in search of potential factors that link
diet to health outcomes. We will explore how
primary data addresses questions like ‘what is a
significant correlation?’ and ‘How might this data
impact our ideas about factors that contribute to
or protect against heart disease?’
Wo r k b o o k
Lesson 4.4
Correlation does not mean
causation!
What is the difference between two variables that are
correlated, and a cause and effect relationship?
Ice Cream Sales Figure 1: Just because ice cream
sales and shark attacks are correlated, it does not mean that ice
cream sales cause shark attacks!
Have you ever heard people declare that every time they
wash their car, it rains? While this may seem to be true, it
certainly is not likely that washing a car is what is causing
rain to fall. There may be a correlation between timing
of the car wash and the weather, but there is no causal
relationship. This delineation may sound simplistic, but the
confusion between correlation and causation can puzzle
even the most well versed scientists.
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
166
LESSON READINGS
A correlation is simply an association between two factors. On the other hand, a causal relationship
means that one factor depends on the other. As we saw in the last lesson, the type of study determines
whether the results can show correlative or causative relationships; observational studies can only
provide correlative data, whereas interventional studies can provide causative data.
The type of study determines the type of results:
The chicken definitely came first, I think? Observational studies can only provide correlative data,
whereas interventional studies can provide causative data.
Because many human studies in nutrition research are observational, they can only show correlations
between a nutrient and a health outcome. Additionally, many causative relationships have only been
demonstrated by an animal model or cell culture study. So how can we make educated decisions about
what the results mean for humans trying to live a healthy life?
In order to prove that one variable (the independent variable) causes another (the dependent variable), the
timing has to be right. This means that the change in the dependent variable must follow the change in the
independent variable. If your independent variable was a meal, for example, and the dependent variable
was blood glucose concentrations, you would see that after someone consumes a meal their blood
glucose concentrations will rise.
Correlative data often gets misinterpreted as causative data
Wo r k b o o k
Lesson 4.4
Figure 2: Eating food
coloring is correlated
with hyperactivity but
may not cause it.
Some nutrients and foods get a reputation for being 'bad', while others
are known as 'good'. Most of these assumptions are based on correlative data. There are many instances where consumption of a nutrient
or food correlates with some health outcome. For instance, increased
consumption of some artificial food colorings correlates with hyperactivity
in children, but the mechanism that would explain how the food coloring
is causing hyperactivity has not been established. Even so, the belief
that the food colorings are the 'cause' for hyperactivity is strong enough
to make parents avoid the chemicals, and has led the FDA to review the
safety of food dyes. When reading a news article describing study
results, ask yourself whether the results are truly demonstrating a cause and effect, or whether there might be alternative
explanations for the results.
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
167
LESSON READINGS
Statistics are used to measure correlation, not causation
Strong results come from good sample selection
DEFINITIONS OF TERMS
Mean — The mathematical average of all samples in a data set.
Median — The numerical value
that separates the higher half of
data from the lower half.
Population — An entire collection of people or animals of interest from which data is collected.
Representative sample — A
subset of the greater population of interest that accurately
reflects the members of the entire
population.
Standard deviation (STD) — Indicates the variability or deviation
for an experimental group.
For a complete list of defined
terms, see the Glossary.
Wo r k b o o k
Lesson 4.4
If we wanted to learn what the common blood glucose
concentration is after eating a specific food, we
could not possibly measure every single person’s
blood glucose. How then do we gather generalizable
research findings? One way to do this is by selecting a
Figure 3: You can't measure
representative sample. This sample should include a
everyone so you need to select a
random selection of people that represent the greater
sample group to study.
population. We use samples as estimates of the
whole population, but we will never be able to know
the results of an entire population without making the measurement on everybody. It is common to find
different results when using different representative samples simply because of the variability among
human participants. If the representative sample chosen was truly characteristic of the greater population,
it should provide a reliable estimate of what results we can expect in the population. Let's say you pick a
perfect sample group, how do you know if your findings are real rather than random? If a person flips a
coin 100 times and gets 85 heads, what are the odds that that would happen? An important concept is
that statistics can be used to measure the probability that your findings are sheer dumb luck!
Common statistical methods
Learning about statistics may seem daunting at first, but it can be
easily grasped after we learn a few simple concepts. Some key ways
we can summarize data include:
Mean Family Income ■■ Average/Mean: The value that represents the calculated central
value of a data set. Calculated by adding up all of the values and
dividing by how many numbers there are.
Median Family Income ■■ Standard Deviation (STD): A measure of the variability of the
data around the mean. If you have ten observations that are
identical the STD is small. If you have ten observations that are
all different the STD will be large.
60,000 ≠
44,000 Figure 4: If data is not
normally distributed,
mean and median can be
very different.
■■ Median: The middle number in a set of data if you sort the numbers from smallest to largest. This
helps you find the most common observation rather than the average one. For example, in the US
the average family income is 60,000 dollars but the median family income is 44,000 dollars. Why
are they so different? Think Bill Gates!
1. What is the difference between
a representative sample and a
population?
aa. The sample is smaller.
bb. The population includes every
person of interest.
cc. The sample is used in scientific
studies.
dd. All of the above.
2. What is the likely cause of a sample
population not representing the
entire population?
aa. The sample size is too small.
bb. Selection bias of the sample.
cc. The sample was not randomly
selected.
dd. A & B only.
ee. B & C only.
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
168
LESSON READINGS
DEFINITIONS OF TERMS
Normal distribution — Also
called a 'bell curve'. Represents
the distribution of many random
variables as a symmetrical bellshaped graph.
For a complete list of defined
terms, see the Glossary.
Wo r k b o o k
Lesson 4.4
When we plot a graph to show the distribuAverage$
tion of data they can make different shapes.
Sometimes data can be skewed to one side,
or have several peaks and valleys. Often the
Bell$Curve$
distribution of data will make the shape of a
hand bell, with no bias to the left or the right
(as shown in Figure 5). When this occurs
the majority of data will fall in the middle of
the graph, with a symmetrical tapering as
you move away from the center. This type
of distribution is called a normal distribuFigure 5: The normal distribution looks like a
tion, or a bell-shaped curve. Many simple
bell, and is sometimes called a 'bell curve'.
examples of a data set will make a normal
distribution, such as the heights of a group of
people, blood pressure, or the grades in a classroom. The shape of the distribution is an important thing to
consider when you of compare data from two groups. When the distribution is normal the average will be
close to the median, but when the distribution is skewed the average will differ from the median, just like
the income example given above.
Comparing groups
As long as the distribution of data in both experimental
groups is the same shape, we can use statistics to
measure how different the two groups are. If the two
groups of data do not differ from one another, they
would overlap on a graph when plotted. The more
different the data sets are, the further apart their distriControl Treatment butions will appear on a graph, as in Figure 6. After
Group Mean Group Mean statistically determining whether the data distribution of
the experimental groups is the same or different, you
Figure 6: Group means can be
will be able to either accept or reject your hypothesis.
compared if both groups have similar
Let's again use blood glucose concentrations after
distributions.
a meal as an example. An experimental hypothesis
may be that people who are obese will have higher
blood glucose concentrations two hours after a meal than people who are lean. Statistics will be able to
determine how different the blood glucose levels are between the groups. But remember, statistics only
measures how different groups are not whether the variables are correlated or causative!
3. The mean and the median are the
same:
aa. In all data sets.
bb. If the data are distributed
normally.
cc. If the distribution of the data is
skewed.
dd. All of the above.
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
169
LESSON READINGS
DEFINITIONS OF TERMS
Coefficient of determination —
Also called 'r-squared'. Indicates
how well data points fit a straight
line.
P-value — The probability of
obtaining a test statistic at least
as extreme as the one observed
if the hypothesis were false.
For a complete list of defined
terms, see the Glossary.
Wo r k b o o k
Lesson 4.4
Figure 7: Statistical significance shows
how related or different data sets are,
not how likely the hypothesis is to be
correct.
A statistical measure called a p-value helps you determine the significance of your results. As a reader, you
can look for the p-value to determine if the differences
between groups are significant. The smaller a p-value
is, the greater the difference is between the groups.
For example, if you do the blood glucose concentration experiment above and find a p-value of 0.01, that
means that there is a probability of 1 in 100 that these
observation happened randomly. In other words, there
is a 1 in a 100 chance that the blood glucose concentrations of the two groups are the same, so we can
say with confidence that the groups are significantly
different. For biological research, a p-value of 0.05 or
less is considered significant.
Statistics can be used to describe correlations
We have talked a lot about
correlations, but what does a
Posi%ve No Nega%ve correlation look like? Variables
Correla%on Correla%on Correla%on can be positively or negatively
correlated. A positive correlation
means that as the value of one
variable increases, so does the
value of the other factor. The
temperature outside may be
positively correlated with your
Figure 8: When two variables are plotted they can somedesire to go to the beach, for
times have a linear relationship, which demonstrates correlation. If there is no linear shape, there is no correlation.
example, whereas the temperature is probably negatively (or
inversely) correlated with your
desire to cozy up next to a fire. The coefficient of determination, or r2 (r-squared), is used to describe
the variability of the data from the linear correlation. In other words, if you plot the variables in against
each other do you get a line or a blob? Data that is very scattered (blob like) when plotted will have an r2
close to zero, and data that makes an almost perfect line when plotted will have an r2 close to 1. You can
think of r2 as representing the percent of the data that is closes to a line of best fit. For example, if the r2 =
0.75, then 75% of the data is in a linear relationship with one another.
4. A significant p-value means
causation has been proved.
aa. True.
bb. False.
5. Two variables are correlated when:
aa. One causes a change in the
other.
bb. A statistical difference isn't
found.
cc. They have a linear relationship
when plotted.
dd. Their coefficient of determination
(r2) is less than 0.01.
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
________________________________
170
STUDENT RESPONSES
Let's say you are observing two groups, one with a BMI over 30 and the other with a BMI between 20 and 25. You find that the
BMI 30+ group has higher LDL cholesterol than the 20-25 group with a p-value of 0.001. What type of conclusions can you
make based on this information? (Please draw graph(s) to help with to explanation.)
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
Remember to identify your
sources
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
Wo r k b o o k
Lesson 4.4
_____________________________________________________________________________________________________
___________________________________________________________________________________________
171
TERMS
TERM
For a complete list of defined
terms, see the Glossary.
Wo r k b o o k
Lesson 4.4
DEFINITION
Coefficient of
Determination
Also called 'r-squared'. Indicates how well data points fit a straight line.
Mean
The mathematical average of all samples in a data set.
Median
The numerical value that separates the higher half of data from the lower half.
Normal Distribution
Also called 'bell curve'. Represents the distribution of many random variables as a symmetrical bell-shaped
graph.
P-Value
The probability of obtaining a test statistic at least as extreme as the one observed if the hypothesis were
false.
Population
An entire collection of people or animals of interest from which data is collected.
Representative Sample
A subset of the greater population of interest that accurately reflects the members of the entire population.
Standard Deviation
Indicates the variability or deviation for an experimental group.
172