Download Chapter 3 Review Two Variable Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 3 Review
Two Variable Statistics
Veronica Wright
Christy Treekhem
River Brooks
The Big Idea
•
•
This chapter explains how scatterplots can be used to represent data in a variety
of useful ways. They give good graphical representations of the relationship
between the two variables and can be used to easily spot trends such as strength
and direction and help to isolate outliers. Residual plots can also be used as a
tool for determining how the variables interact. The LSRL, correlation, and
correlation coefficient can be used to predict results based on the data and to
mathematically prove just how accurate these predictions are.
We use this all the time in statistics and just about everywhere else. Just looking
at a scatterplot a person already uses a number of these principles in order to
infer information from it. The most obvious piece of information being how
things will develop based on all of the data that has been collected so far. Almost
every field uses this. Economics, politics, manufacturing, and even sports.
Important Vocabulary
•
•
•
•
•
•
•
•
Direction – the overall direction that data moves towards when displayed on a
scatterplot
Scatterplot – a graph that shows the relationship between 2 quantitative variables
that are measured on the same individuals
Response variable – a variable that measures the outcome of a study, i.e. dependent
variable
Explanatory variable – a variable that influences the response variable, i.e.
independent variable
Form – the shape that the data resembles when displayed on a scatterplot ex. curved,
linear, exponential, etc.
Strength – how closely the data points follow the form
Outlier – a data point that doesn’t follow the form as closely as all the others, a data
point that seems significantly out of place on a scatterplot
Correlation coefficient – a measure of the direction and strength of the linear
relationship between two quantitative variables, usually represented as r
Important Vocabulary
•
•
•
•
•
•
•
Regression line – a line that describes how the response variable changes when the
explanatory variable changes
Extrapolation – using the regression line to predict results beyond the scope of the
actual data
LSRL, the least-squares regression line – a line that has the smallest possible total
distance from the data points: ^y = a + bx
Residual – the difference between an actual data point and where the regression line
says that particular data point should fall
Residual plot – a scatterplot of the data’s residuals against its explanatory variables
Coefficient of determination – the amount of variability in the data that is accounted
for by the LSRL, the higher the coefficient, the more accurately the LSRL represents
the data. It is usually shown as r^2 and never greater than 1
Lurking variable – a variable other than thee response and explanatory variables that
may influence the relationship between them
Key Topics Covered in the Chapter
• How to graph and determine the
relationship between independent
(explanatory) and dependent (response)
variables
• Correlation– how to find it, and what it
means
• Regression line (Best fit, LSRL) – how to
find it, what it means, and how well it fits
the data
Formulas You Ought to Know
• The regression line formula (LSRL):
– ŷ = a + bx
• With ŷ being the predicted response, a being the y-intercept, b
being the slope, and x being the explanatory variable.
• The formula for the mean:
– (a1 + a2 + a3,+......+ an)/n
• The formula for standard deviation:
• The formula for r (correlation coefficient)
Calculator Key Strokes
• In this unit, on our calculator we are forced to find the Sx, Sy, mean
of x, mean of y, r, r2, and LSRL, as well as graphing the scatterplot
and residual plot.
•
To find the r^2, r, and LSRL, do the following:
(enter data sets into L1 and L2)
•
Insert your lists in the order,
(Explanatory List, Response List)
To find Sx, Sy, the mean of x, or the mean of y, do all of the above, except press “2” instead of
“8”
(enter data sets into L1 and L2)
Insert your lists in the order,
(Explanatory List, Response List)
Calculator Key Strokes
• To plot the scatterplot, do the following:
•
•
Enter Data set
Make sure plot is On
Choose the Scatterplot
To find the Residual plot, do all of the above, except change “Ylist” to “Resid”
(If you cannot find the RESID button in your Statlist, do the following):
And now it should work, but MAKE SURE
•
•
that you have already calculated the LSRL.
(Scroll down to DiagnosticOn)
Example Problems
• A study shows that there is a positive correlation between the size of
the hospital and the median number of days patients remain in the
hospital. Does this mean you can shorten a stay by choosing a small
hospital? Explain.
• No, correlation is not causation. Also, the patients with minor
injuries may not feel the need to go to a larger hospital, thus
shortening the stay.
Example Problems
• The Standard and Poor 500 index is an average of the price of 500
stocks. There is a moderately strong correlation (r equals
approximately 0.6) between how much this index changes in
January and how much it changes during the entire year. If we
looked instead at data on all 500 individual stocks, we would find a
very different correlation. Would the correlation be higher or lower?
Why?
• The correlation would be lower; the individual stock performances
will be more variable, weakening the relationship.
Example Problems
• A study of elementary school children ages 6-11 finds a high positive
correlation between shoe size x and score y on a test of reading
comprehension. What explains this correlation?
• Age is a lurking variable. We would expect both quantities to
increase with age.
Example Problems
• A college newspaper interviews a psychologist about student ratings
of the teaching of faculty members. The psychologist says, “The
evidence indicates that the correlation between research
productivity and teaching rating of faculty members is close to zero.”
The paper reports this as “Professor McDaniel said that good
researchers tend to be poor teachers, and vice versa.” Explain why
this is wrong, and explain the psychologist's meaning.
• Professor McDaniel did not say that good researchers make poor
teachers; he simply said that there is a low correlation between
research productivity and teaching rating.
Example Problems
• Explain why this is wrong: “There is a high correlation between
gender of American workers and their income.”
• Gender is categorical, not quantitative.
Example Problems
• Explain the error: “We found a high correlation (r=1.09) between
students' ratings of teaching and ratings made by other faculty
members.”
• r must be between 0 and 1.
Helpful Hints
• Some people can’t find the RESID button to
get the residual plot plot/get an error: That’s
because you need to find the LSRL first
before that is even possible.
• If you can’t see anything when you plot your
scatterplot, press Zoom -> 9.
• If the RESID plot has any type of pattern, you
don’t want an LSRL. A different model –
perhaps a power or exponential one, if it is
curved – would suit the data better.
The End
• Click to add text