Download Note taking Guide - Germantown School District

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Tuesday, September 25: 5.1-5.2 Describing Bivariate Data with Scatterplots (review ch. 3-4)
The first step when analyzing the relationship between two _______________ variables is to graph the
data using a _______________.
In a scatterplot, the ________________ variable should be on the x-axis and the ___________ variable
should be on the y-axis. The explanatory variable seeks to explain or predict changes in the response
variable. Usually the explanatory variable comes first chronologically.
Each axis should be clearly __________ with the variable’s name and unit. It should also have a well
marked and uniform ___________ on each axis, however, the scales do not need to be the same.
The axes often intersect at (0, 0) but this can change depending on the range of the data sets. Patterns
are often more visible when there is less “empty space” and the data is more spread out.
In general, when describing a scatterplot,
1. Look for the ______________ of the association
 __________________________ higher values of one variable are associated with higher values
of the other variable

__________________________: higher values of one variable are associated with lower values
of the other variable

__________________________: higher values of one variable do not give any information
about the values of the other variable
2. Look for the _______ of the data
3. Look for the ______________.

If there is not much scatter, we say there is a “strong” association between the variables.
4. Look for ____________________: _____________ that fall outside the pattern of the rest of the data
and __________ of points that are isolated from the rest of the data. Always investigate these values!
For the following sets of bivariate data, which would be the explanatory variable?
1. Scuba diving: depth and visibility
2. World population vs. year
3. Amount of rain vs. crop growth
4. Height vs. GPA
Using the TI-83 to make scatterplots
 Zoomstat
 Window
 Note: to sort bivariate data and keep the ordered pairs together, enter SortA(L1, L2). This will
sort the data by L1 and keep the pairs together.
HW #1 5.1-5.9 odd
Thursday, September 27: 5.3 Fitting a line to Bivariate data (chapter 3-4 test)
When the form in a scatterplot is linear, we can use an equation in the form ŷ  a  bx to model the
relationship between the explanatory variable (x) and the response variable (y)
 a = y-intercept (constant)
 b = slope
 ŷ (“y hat”) signifies that the value of ŷ is an estimate or prediction
 Statisticians prefer the form ŷ  a  bx instead of ŷ  mx  b , but they are equivalent.
 Some books use the notation ŷ  b0  b1 x
How can we find the best linear model?
Since our goal is to make good predictions, we want to minimize the vertical deviations from the
observations to the line. These vertical deviations are called __________________.
residual = observed y value - predicted y value = y - ŷ
The best fitting line is the line which minimizes the sum of the squared residuals,
  y  yˆ 
2
.
This line is called the __________________________________________________ (LSRL).
JMP-Intro Script: LeastSquaresDemo
Applet: http://mathforum.org/dynamic/java_gsp/squares.html (try this at home)
Using the TI-83 to calculate the LSRL:
 enter the data in L1 and L2
 stat: calc: 8: LinReg (a+bx) L1,L2 (Note: 4 and 8 are the same, just different forms)
 You should always use the TI83 to find the LSRL. Ignore any directions that say otherwise.
Consider the following data describing the age (in months) vs. height of infants (in inches):
a. sketch the scatterplot and describe what you see
age height
1
20.5
1
19
2
21
4
22
4
23.5
6
22.5
7
23
7
24
9
26
b. since the scatterplot shows a linear form, calculate the LSRL and graph it on the plot
c. interpret the slope in the context of the problem
d. interpret the y-intercept in the context of the problem
e. if a child is 5 months old, how tall should he be, based on the model? In other words, how tall is
an average 5 month old?
f. would you be willing to predict the height of a 10 year old child with this model? Why not?
Def: _____________________ is using your model to make predictions outside of the range of the data.
It is very unreliable since the pattern of the data may not stay the same.
HW #2 5.26, 27, 29, 30
Monday, October 1: 5.4 Assessing the Fit of a Line
After we find the Least Squares line, we should examine how well the model fits the data.
Important questions to consider are:
1. Is a linear model really appropriate, or would a curved model be better?
2. If we make predictions with the model, how accurate will our predictions be?
3. Are there any unusual aspects of the data set we need to consider before we make predictions
with the model?
Question 1: Is the linear model appropriate?
Remember that a residual is the vertical distance from the point to the line: y – ŷ
Def: a ____________________ gives us a closer look at the pattern of the residuals. It plots the xvalues on the x-axis and the residuals on the y-axis.
What kind of pattern will we see in a residual plot if we fit a line to linear data?
Making residual plots on the TI-83:
L1 = x
L2 = y
L3 = ŷ
L4 = y – ŷ
Scatterplot L1, L4
For the following data sets, sketch the original data with the least squares line. Then, sketch the residual
plot and use it to decide if the linear model is appropriate.
x y
1 1
2 5
3 7
4 8
5 12
6 12
7 17
x y
1 .1
2 .8
3 2
4 3.3
5 5
6 7.3
7 9.9
For the second data set, the data is close to the line (not much scatter) even though there is a obvious
curve in the residual plot. The residual plot indicates that a line is not the best way to model this data.
However, the lack of scatter means that the predictions using the linear model will still be fairly accurate
within the range of our data, though not as accurate as with a curved model.
In conclusion, a residual plot will tell you if a linear model is the right type of model (has the right form)
or if we should consider fitting a non-linear model.
NOTE: Sometimes a residual plot is made with the predicted values ( ŷ ) on the x-axis and the residuals
on the y-axis. This is because computer software packages are built for multiple regression, which uses
many different x-variables to predict y. Instead of using 1 of the x-variables and ignoring the others,
they use the predicted values since they are a function of all the x’s. However, the plot will still show
the same characteristics and should be interpreted in the same way.
HW #3: 5.47, 5.48
Tuesday, October 2: 5.4 How accurate will our predictions be?
Suppose that I randomly selected 10 students from this class and recorded their weight (in pounds):
{103, 201, 125, 179, 150, 138, 181, 220, 113, 126}
If I were to randomly select one more student, what would be a good prediction for his or her weight?
Of course, this prediction is not likely to be correct.
Typically, how far are the observations from the mean? In other words, how far off should we expect to
be?
Is there any way to improve our prediction? In other words, is there a way I can reduce the standard
deviation?
Here are the heights (in inches) for the original 10 students:
{61, 68, 65, 69, 65, 61, 64, 72, 63, 62}
Sketch the scatterplot and calculate the LSRL.
Of course, the predictions using the regression line aren’t perfect either.
Standard Deviation about the Least Squares Regression Line:
To get a sense of how close the points are to the line, we can calculate the standard deviation about the
least squares regression line, which gives an estimate of the average distance each observation is from
the line (in other words, the average residual).
se 
  y  yˆ 
n2
2

SSError
SSResid

n2
n2
Note: “SS” = “Sum of Squares” so SSResid is the sum of squared residuals
Note: se is also called “root mean square error” (RMSE) or simply s.
Calculate the standard deviation about the regression line for this data:
The coefficient of determination, r 2 , is a measure of the proportion of variability in the y variable that
can be “explained” by the linear relationship between x and y.
For example, suppose we open a pizza parlor, selling pizzas for $8 plus $1.50 per topping. If we were to
plot the points (0, 8.00), (1, 9.50), (2, 11.00) they would fall exactly on a line. In this case, the number
of toppings explains 100% (all) of the variability in price. Thus, r 2 = 1, or 100%.
Calculate the coefficient of determination for the height and weight data:
Look at the r-squared program again
To measure the total variability in the y variable (weight), we measure the variability of y from its mean:
The total sum of squares = SSTotal =


( y  y)
2
=
Note: We do not consider the x variable at all when we calculate SSTotal.
Note: This is the same quantity that we use when we calculate s, the sample standard deviation
for one variable.
We can also consider the variability in y (weight) that still remains after we factor in x (height):
This is called the residual sum of squares: SSResid = SSError =

 ( y  yˆ )
2
=
Note: this is the same quantity that we used when we calculated se , the standard deviation about
the LSRL.
The difference between SSTotal and SSError is called SSModel. SSModel is the variability in y
(weight) that is explained by x (height).
•
y - yˆ }
{ yˆ - y
y
}
y- y
yˆ  a  bx
SSTotal = SSModel + SSResid =



  y  y    yˆ  y    y  yˆ 
2
2
2
SSTotal is the variability in the response variable (considered by itself)
SSModel is the variability in the response variable that is accounted for by the explanatory
variable
SSResid is the variability in the response variable that is not accounted for by the explanatory
variable
Thus, the coefficient of determination can be computed as:
SSModel SSTotal-SSResid
SSResid
variability of residuals
=
r2 =
=
=1=1SSTotal
SSTotal
SSTotal
variability of weights
Thus, we can say that ____% of the variability in a weight can be explained by height (the total
variability in weight has been reduced by ____%). This also means that ____% of the variability in
weight remains unexplained (it is due to other factors).
We can also say that height accounts for ____% of the variation in weight.
Using the TI-83 to calculate r 2 .
What is the relationship between r 2 and se ?
 Both measure how well a line models the data
 r 2 has no unit and is usually expressed as a percent between 0% and 100%
 se is expressed in the same units as the response (y) variable
If n = 10, se = 8, and s y = 25, calculate r 2 .
AP Question
HW #4: 5.38, 39, 40, 44, 49
Thursday, October 4: 5.2 Correlation / Understanding Regression Output
Another way to help us quantify the amount of scatter in a scatterplot is to calculate the
_____________________________________, which measures the strength of the linear relationship
between two quantitative variables.
For example, consider the following scatterplot showing fat (grams) vs. calories in 5 ounces of various
types of pizza (Statistics in Action, page 134).
How would the relationship change if I converted the values on the x-axis into milligrams instead of
grams?
Since the units of measure do not matter, we can standardize them by calculating the z-score for each
observation.
x  x x  12.17

For fat, z x 
sx
3.70
y  y y  331.92
For calories, z y 

sy
29.15


Now, we can look at a scatterplot of the z x , z y ordered pairs.
Which points give evidence that there is a positive association?
Which points count against a positive association?
Since the points in quadrants I and III indicate a positive association, we will use the products zx zy
which are always positive in QI and QIII and negative in QII and QIV.
To get an overall sense of the relationship, we add up these products:  zx zy . If the sum is positive, we
have a positive association. If the sum is negative, we have a negative association. If there is no
association, the sum should be close to 0.
Also, since the size of this sum will get bigger the more data we have, we divide the sum by n – 1 to find
the correlation coefficient:
 zx z y  18.95  .82 Note: r   r 2
r
n 1
24  1
Note: When we use the word “correlation” in statistics, we are referring to the correlation coefficient. If
you want to describe a relationship in a more casual way, use the word “association”.
Note: r is often called Pearson’s correlation coefficient
Properties of the Correlation Coefficient:
1. The value of r does not depend on the unit of measure since r is based on z-scores, which have no
units. For example, the relationship between height and weight is equally strong if we use inches and
pounds or centimeters and kilograms.
2. r has no units.
3. The value of r does not depend on which variable is x and which is y. The product z x z y is the same
as z y z x . Note: this isn’t true for the slope, intercept, or SD.
4. -1 ≤ r ≤ 1.
 When r > 0, the relationship is positive.
 When r < 0, the relationship is negative.
 As r --> ±1, the relationship is stronger and has less scatter
 As r --> 0, the relationship is weaker and has more scatter
5. r = ±1 only when the data are in a perfect line. This is the only case where the values of one variable
can be completely determined by the values of the other variable.
6. The value of r is a measure of the strength of a linear relationship. It measures how closely the data
fall to a straight line. An r value near 0, however, does not imply that there is no relationship, only no
linear relationship. For example, quadratic or sinusoidal data have an r close to 0, even though there
may be a strong relationship present.
Also, even though r measures the strength of a linear relationship, it does NOT tell us if a linear model is
appropriate. Only ___________________ can do that. The correlation coefficient just measures how
much scatter there is from the line on a scale from -1 to 1.
Don’t confuse correlation with causation:
For example, in the last 10 years, the number of students taking the AP Statistics exam has grown from
7500 to over 75,000. At the same time, the national murder rate has been decreasing, so there is a
negative correlation. Does this mean that we should make more people take AP Statistics so there will
be fewer murders?
Also, there is a strong positive association between monthly ice cream sales at Baskin Robbins and
monthly drowning deaths. Should we close Baskin Robbins to save people from drowning?
You can never prove cause-and-effect from a scatterplot!!
Understanding Computer Output: On the AP Exam, questions about regression frequently include
output from computer software, such as JMP-Intro.
Average High Temperature Annual Precipitation
70
15
71
13
73
9
74
12
76
8
76
10
77
7
72
11
72
12
In the “Analyze” menu, choose “Fit Y by X” and enter annual precipitation for Y and average high
temperature for X. Click OK. You will see a scatterplot of the data. Click on the red arrow and choose
“fit line.” You should see the following output:
Linear Fit
Annual Precipitation = 76.437788 - 0.8940092 Average High Temp
Summary of Fit
RSquare
0.747577
RSquare Adj
0.711516
Root Mean Square Error
1.363495
Mean of Response
10.77778
Observations (or Sum Wgts) 9
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model 1 38.541731
38.5417
20.7312
Error
7 13.013825
1.8591
Prob > F
C. Total 8 51.555556
0.0026
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept
76.437788 14.42794 5.30
0.0011
Average High Temp -0.894009 0.19635 -4.55 0.0026
The least squares regression line is given at the top, but on many AP questions, they will only give you
the bottom table. Make sure you can find the equation from the parameter estimates table only!
Other notes:
 To find the correlation coefficient (r), take the square root of RSquare ( + or -)
 You can ignore RSquare Adj (this is a multiple regression topic)
 Root Mean Square Error is the standard deviation about the regression line, se . Other software
use the symbol “S” for this quantity.
 Mean of response is the average y-value ( y ).


The Analysis of Variance table has all of the Sums of Squares needed to calculate r 2 and se .
To make a residual plot, click on the red arrow by “linear fit” and choose “plot residuals”
HW #5 5.10-12, 14, 16-18, JMP problem (use the TI-83 to calculate r)
JMP: A random sample of cars was selected and the horsepower (HP) and miles per gallon (MPG)
were calculated. Let x = HP and y = MPG.
a. Make a scatterplot with the least squares regression line.
b. State the equation of the line.
c. Interpret the slope in the context of the problem.
d. Make a residual plot and comment.
e. State and interpret the value of the coefficient of determination.
f. State and interpret the value of the standard deviation about the regression line.
g. Show how parts e and f could be calculated without using the Summary of Fit section.
h. Calculate the correlation coefficient.
i. Predict the MPG for a car with 175 HP.
j. Suppose that we added a new point to the data set (HP = 250, MPG = 35). Add this point to
your scatterplot by hand. How do you think this will this affect the slope, r, r 2 , and se ?
k. Add this point in JMP to check your answers. Include the new output.
HP
MPG
110
36
170
23
165
20
93
29
142
21
214
28
114
30
124
29
225
23
255
25
155
31
200
27
70
43
81
33
168
28
Monday, October 15: 5.3 Regression
Earlier we learned why the Least Squares Regression Line includes the words “Least Squares.” Today,
we will learn why it includes the word “Regression.”
Thinking about the pizza example from yesterday if we had a pizza with an average amount of fat, it
would be reasonable to predict that it would also have an average number of calories.
This illustrates an important property of the LSRL: It will always go through the point ________.
If we think about the scatterplot of  z x , z y  , this means that the line will go through (0, 0). In other
words, the y-intercept will be 0. Furthermore, it can be proven that the slope of the LSRL on the
standardized plane is equal to the correlation coefficient (r). Thus, on the standardized plane, the LSRL
is: zˆ y  0  r  zx 
Note: Regression slopes don’t tell us the strength of an association since they are dependent on units.
For example, if y = height in meters, converting to cm increases the slope 100 times. Correlation is the
standardized version of the slope.
Here is the scatterplot of z-scores for the pizza data along with the line zˆ y  .82  zx 
So, if an x-value is one standard deviation
above the mean of x  zx  1 , the predicted
y-value will only be r standard deviations
above the mean of y  zˆ y  r 1  r  . If an xvalue is two standard deviations below the
mean of x  zx  2 , the predicted y-value
will only be 2r standard deviations below the
mean of y  zˆ y  r  2   2r 
Thus, since 1  r  1 , the predicted value of
y will almost always be closer to its mean
than the x-value is to its mean (in terms of
standard deviations). This illustrates the
concept of “regression to the mean.”
zˆ y  r  zx 
zˆ y  r 1
z x  2
1
zx  1
zˆ y  2r
This concept was first publicized by Francis Galton, who noticed that that tall fathers had tall sons, but
not quite as tall on average, and that short fathers had short sons, but not as short on average.
For example, suppose we were using the height of a father (x) to predict the height of his son (y) and
that r = 0.7. Then, if a father was 2 standard deviations below average in height, we would predict his
son’s height to be only ________ standard deviations below average.
Also, as the correlation gets closer to 0, our predicted values will be closer the mean value of y. This
seems reasonable since as the correlation gets weaker, we will have less confidence making predictions
that are far from the mean.
So far, we have been working with standardized values. What if we wanted to use the original units?
zˆ y  rz x
 xx 
yˆ  y
 r

sy
 sx 
s 
yˆ  y  r  y   x  x 
 sx 
s 
s 
yˆ  r  y  x  r  y  x  y
 sx 
 sx 
s 
Since the slope of the regression line is the coefficient of x, b  r  y  .
 sx 
Once we know the slope, we can use the fact that  x , y  is on the line to find the y-intercept: y  a  bx
For example, find the LSRL if x  65, sx  4, y  150, s y  30, r  .62 .
Note: Unlike the correlation coefficient x and y are not interchangeable when calculating the LSRL.
Therefore, we should never use a y-value to try to predict an x-value.
Note: if r = 0, then the slope = 0 and the LSRL is horizontal: y  y . If there is no linear association,
knowing x won’t help us predict y!
HW #6 5.31-35, 42
Tuesday, October 16: 5.4 Unusual and Influential Points
The following data show the weight (in pounds) and cost (in dollars) of a sample of 11 stand mixers
(from Consumer Reports 11-05).
1. Sketch a scatterplot of the data and find the least squares regression line and correlation.
Weight Price
23
180
28
250
19
300
17
150
25
300
26
370
21
400
32
350
16
200
17
150
8
30
2. What do you think will happen if we remove the outlier (the Walmart brand)? Sketch a scatterplot of
the remaining data and find the least squares regression line and correlation. How do they compare?
3. What if the outlier was made from an expensive lightweight alloy so that the observation was (8,
700)?
4. What if the first mixer in the data set went on clearance sale for $25. How will this change the
regression line?
Summary: Any point that stands apart from the others is called an ____________. Since the LSRL
must pass through the point  x , y  , points that are separated in the x-direction can be particularly
________________. We say they have high _______________.
When a point with high leverage lines up with the rest of the data, such as #1 above, the line won’t
change very much, but the correlation will be stronger.
When a point with high leverage does not line up with the rest of the data, such as #3 above, it can have
a large effect on both the line and the correlation. Note: Points with high leverage often do not have
large residuals, since they pull the line close to them.
Points that are near x will usually not be influential, such as #4 above. Why not?
Regression Wisdom: Always graph the data and investigate outliers!
Applet: http://www.stat.uiuc.edu/courses/stat100//java/guess/PPApplet.html
This allows you to add/subtract points and see effect on correlation, LSRL, and RMSE
Applet: http://statweb.calpoly.edu/chance/applets/LRApplet.html
Allows you to move points around to see changes in r, lsrl. Dynamic!
AP Question:
HW #7: See next page
HW #7: When will the Cherry Blossoms Appear? (from Dan Teague, NCSSM)
The anticipation of the first blooms of spring flowers is one of the joys of April. One of the most
beautiful is that of the Japanese cherry tree. Experience has taught us that, if the spring has been a warm
one, the trees will bloom early, but if the spring has been cool, then the blossoms will appear later. Mr.
Yamada is a gardener who has observed the date in April when the blossoms first appear for the last 24
years. His son, Hiro, went on the internet and found the average March temperature (in degrees Celsius)
in his area for those years. The data is below. To verify that you entered the data correctly, the mean
temperature is 4.321 and the mean days is 12.875.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Why should temperature be the explanatory variable? Explain.
Draw a scatterplot and discuss the noticeable features. Is one variable
completely dependent on the other?
Calculate the least squares line and graph it on the scatterplot.
Interpret the slope in the context of the problem.
Interpret the x- and y-intercepts in the context of the problem.
Find the value of the correlation coefficient.
If the temperature was measured in degrees Fahrenheit, how would this value
change?
If r is high, can we conclude that a change in temperature causes the blooms to
appear at different times? Explain.
Calculate and interpret the residual for the first point in the data set.
Sketch the residual plot. What does it tell you?
Calculate and interpret the values of r 2 and se in the context of the problem.
If you were to use number of hours instead of number of days, how would the
values of r 2 and se change?
Predict the date of first bloom for an average March temperature of 3.5˚.
Which observation do you think is be most influential? Explain.
Which observation had the biggest residual? Is it unusually large?
Temp Days
4.0
14
5.4
8
3.2
11
2.6
19
4.2
14
4.7
14
4.9
14
4.0
21
4.9
9
3.8
14
4.0
13
5.1
11
4.3
13
1.5
28
3.7
17
3.8
19
4.5
10
4.1
17
6.1
3
6.2
3
5.1
11
5.0
6
4.6
9
4.0
11
Thursday, October 18: Review chapter 5
Distribute Projects: Proposal due Thursday
Review for test
AP Question
Monday, October 22: Test Chapter 5
Tuesday, October 23: Review for Midterm
AP Questions
Work on proposals
Thursday, October 25: Review for Midterm
Data exploration: The Cootie problem
Proposals due!
Monday, October 29: Midterm