Download CH 3 Student Notes - Princeton High School

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
AP Statistics
Chapter 3
Chapter 3 – Examining Relationships
Scatterplots, Correlation, and Least – Squared Regression
Section 3.1: Variables, Scatterplots, Interpreting Scatterplots, correlation
•
• Define vocabulary and types of variables
Construct a scatterplot using Excel and/or your graphing calculator
• Interpret various scatterplots
• Add categorical data to a scatterplot on your calculator
Scatterplot = displays the relationship between two quantitative variables measured
on the same individuals.
•
•
•
•
Can examine the nature of a relationship between two variables
Examine if a variable can explain changes in the other variable.
If predicting an outcome, an explanatory and response variable must be
identified.
Naming variables in this way does not mean one variable causes the changes
in the other…remember…causation is very difficult to prove.
Explanatory Variable
• Attempts to explain the observed outcomes
• Any factor that can influence the response
• The variable that attempts to explain the relationship with the other variable.
• If there is one, it will be the variable on the x - axis
Response Variable
• Measures an outcome of a study
• Variable that responds to the other variable being used.
• If there is one, it will be the variable on the y - axis
Each individual in the data appears as the point in the plot fixed by
the values of both variables for that particular individual.
Explanatory/Response Variable Process
• Look at the data you have been given with a graph
• Calculate any numerical summary you can for the data
o 2 – Vars Stat
• Summarize the shape, overall patterns, outliers, etc. (see correct statistical
vocabulary)
• If conditions are met, find a mathematical model for the data.
To construct a scatterplot:
• Use graphing calculator.
• Use Excel – not all that useful
Interpreting Scatterplots
• Describing the overall pattern of the data
• Look for any deviations from the pattern.
• Form (clusters, linear, curved)
• Direction (negatively, positively associated, no association)
• Strength
o How closely the points follow a clear form (How closely
together the points are and how closely they resemble a line)
o Strong, moderate, weak, or a combination of these words.
• Outlier (observation that falls outside the overall pattern of the
relationship)
• Clusters (often when looking at the subject of the data, a possible
explanation can come about)
Positively Associated
• High values of one tend to accompany high values of the other variable
• Low values of one tend to accompany low values of the other variable
Negatively Associated
• High values of one tend to accompany low values of the other variable.
• Low values of one tend to accompany high values of the other variable
Nonlinear data
• Data does not have to have a linear relationship
• Certain data can be transformed to show a linear scatterplot (Chapter
12)
Activity - Height vs. Shoe Size
Is there a relationship between your height and your shoe size?
• On the slip of paper provided, label and complete like the following using
your OWN scores:
o Height: 67 inches
o Shoe size: 8 (men’s)
• Go to L1 and L2 on your calculator, copy the data from my calculator
• Create a scatterplot on your calculator using the data.
Is there a relationship between your height and your shoe size? Interpret the
scatterplot.
Correlation
•
•
Calculate the correlation using your graphing calculator.
Interpret data using a scatterplot and the numeric summary
1. Interpreting Scatterplots – Graph Analysis (Visual)
• Form – clusters, linear, curved
• Direction – positive, negative, no association
• Strength – weak, moderate, strong relationship
• Discuss outliers (if present)
• If no outliers, STATE there are no outliers.
2. Numeric Analysis
• Correlation
o Numerical measure of the direction and strength of a linear
relationship between two variables.
o “r”
where r =
" xi − x %" yi − y %
1
'
$
'$
∑
n −1 # sx &$# sy '&
o Knowing the formula helps in understanding the important
properties of correlation.
o Correlation is calculated on your graphing calculator.
o “Diagnostic On” in order to see it, under catalog.
o Average of the products of the standardized variables.
Facts about Correlation
1.
2.
3.
4.
Makes no distinction between explanatory and response variables
Requires both variables be quantitative (no categorical)
It has no unit of measurement
A positive value indicates a positive association and a negative value
indicates a negative association
5. The value will always be a number between “1” and “-1 “
6. ONLY measures strength of linear relationships
7. It is NOT a resistant measure therefore it is strongly affected by outliers
Correlation is NOT a complete summary of two data sets.
• When reporting a numerical analysis, ALSO report the following:
§ Mean of data set #1
§ Standard Deviation of data set #1
§ Mean of data set #2
§ Standard Deviation of data set #2
Interpreting Correlation
Correct Vocabulary to use when interpreting Correlation (the
strength):
•
•
•
•
•
•
•
Very Weak
Weak
Moderately Weak
Moderate
Moderately Strong
Strong
Very Strong
Section 3.2: Least Squares Regression and “ r 2 ”
•
•
Calculate a least squares regression line
Calculate the coefficient of determination.
• Interpret your findings. €
Least Squares regression (LSRL)
•
•
•
•
•
•
A mathematical model of the data.
Method for finding a line that summarizes the relationship between two
variables.
A line that is as close as possible to the points in the vertical direction.
The line should make the vertical distances of the points as small as
possible since we are predicting “y”.
Predication error occurs in the “y” value.
Round values to 4 decimal places when writing it down.
Least – Squares Regression Line
•
A line of “y” on “x” is the line that makes the sum of the squares of the
vertical distances of the data points from the line as small as possible.
Regression Line
•
•
•
•
Straight line that describes how a response variable “y” changes as an
explanatory variable “x” changes.
Used to make predictions of “y” values given certain “x” values
Requires explanatory and response variables to be assigned.
Calculate a linear regression on your calculator using “LinReg(a+bx)”
(number 8 in the list under “Stat”, “Calc”)
o Keeps consistent with the notation used on the AP and in our
textbook.
Equation for Least – Squares Regression Line
ŷ = a + bx
where
slope => b = r
sy
sx
intercept=> a = y − bx
•
Important note:
y = observed value of the variable
ŷ (y – hat) = predicted value of the variable
•
Slope
o Rate of Change
o The amount of change in ŷ when “x” increases by 1.
• Intercept
o Needed for drawing the LSRL
o Only statistically meaningful when the response can take a
value close to zero.
Coefficient of Determination - r 2
• The fraction of variation in the values of “y” that is explained by
the least – squares regression of “y” on “x”
• The proportion€of the total sample variability that is explained by
the least – squares regression of “y” on “x”.
• The percent variation in “y” that can be explained by the least
squares regression of “y” on “x”
• To have a large percentage, the SSE (sum of squares for error) =>
( y − ŷ)
•
2
must be small.
Example Interpretation: “83% of the variation in
(response
variable)
is explained by least-squares regression of (response
variable)
on
(explanatory variable)
.
Important concepts about the Least Squares Regression/Coefficient
of Determination
1. The distinction between the explanatory and response variables is
essential in regression.
• A line that is as close as possible to the points in the vertical direction.
• Different response variable (y – values) will change the least – squares
regression line.
2. There is a close connection between correlation and slope of a LSRL.
• A change of one standard deviation in “x” corresponds to a change of r
standard deviations in “y”.
s
r ⋅ sy
• b=r y
b=
sx
sx
or
“r” is rarely, if ever, “1” or “-1”. Therefore, the
standard deviation of “y” will be multiplied by a fraction between “1”
and “-1”.
3. The least – squares regression line will always pass through the point
( x, y ) on the graph of “y” against “x”.
4. The coefficient of determination ( r 2 ) is the fraction of the variation in
values of “y” that can be explained by the least – squares regression of
“y” on “x”.
€
Section 3.2 (again) – Residuals and Residual Plots
•
• Calculate residuals using a graphing calculator
Calculate residual plot using your graphing calculator.
• Interpret your findings.
Coeffecient of Determination - r 2
• Percent or proportion of explained changes in explanatory variable that
affect the changes in the response variable due to the regression line.
€
Linear Regression Line
• Mathematical model for the overall pattern of a linear relationship
between an explanatory variable and a response variable.
• Least – squares regression line makes the vertical distance from
observations to the LSRL as small as possible.
Residuals:
• The small vertical distances between observations and the LSRL represent
left-over variation in the response variable.
• Difference between the response variable’s observed value and the
predicted value from the regression line.
• Residual = y − ŷ for each data point
Residuals:
• Show how far the data fall from the regression line.
• Positive values mean the data point lies above the regression line
• Negative values mean the data point lies below the regression line
Residuals: The mean of the least – squares residuals is always zero. (You will
find it is almost equal to zero, due to rounding errors.)
Residual Plots:
• A scatterplot of the regression residuals against the explanatory variable.
• The straight line at zero corresponds to the regression line.
• Assesses the fit of a regression line
Interpretation of Residual Plots
• IF the Residual plot shows no systematic pattern,
o THEN the LSRL captures the overall relationship between the
explanatory and the response
Other Interpretations of Residual Plots
• A curved pattern shows that the relationship is NOT linear.
• This means that an LSRL is not the correct method for this data.
•
Increasing or decreasing spread about the line as “x” increases indicates
the prediction of “y” will be less accurate for those “x” where the increase
spread is occurring.
NECESSARY STEP:
• Prior to calculating residuals and a residual plot, you need to complete a
scatterplot and a least – squares regression
• If you do not complete the above step, there will be NO residual plot for
the data you are using.
Residuals and Residual Plots
Example:
Does the age at which a child begins to talk predict a later score on a test
of mental ability?
a) Enter the data into your graphing calculator and construct a scatterplot
ONLY. Interpret your findings from the scatterplot.
b) In the context of the problem, describe the relationship you observe from
the scatterplot.
c) Calculate the correlation and interpret your findings.
d) Find the least – squares regression line for the data (round to 4 decimal
places)
e) Calculate the coefficient of determination and interpret your findings.
f) Use the least – squares regression equation to predict Child #7’s, 13’s, and
20’s scores.
g) Using your graphing calculator, find all the residuals for the data set.
(Done under the “Edit List” feature) Write down the first 2 residual values.
h) Make a residual plot. Interpret your residual plot. What does it tell you
about your least – squares regression line for the data.
3.69 The mean height of American women in their early twenties is about 64.5
inches and the standard deviation is about 2.5 inches. The mean height of men
the same age is about 68.5 inches, with standard deviation about 2.7 inches. If the
correlation between the heights of husbands and wives is about r = 0.5, what is
the slope of the regression line of the husband’s height on the wife’s height in
young couples? Use your calculator to get a regression line. Predict the height of
the husband of a woman who is 67 inches tall.
3.57 In Professor Friedman’s economics course the correlation between the
students’ total scores prior to the final examination and their final examination
scores is r = 0.6. The pre-exam totals for all students in the course have mean 280
and standard deviation 30. The final exam scores have mean 75 and standard
deviation 8. Professor Friedman has lost Julie’s final exam but knows that her
total before the exam was 300. He decides to predict her final exam score from
her pre-exam total.
(a) What is the slope of the least-squares regression line of final exam scores on
pre-exam total scores in this course? What is the intercept?
(b) Use the regression line to predict Julie’s final exam score.
(c) Julie doesn’t think this method accurately predicts how well she did on the
final exam. What is r 2 ? Use the value you get to argue that her actual score
could have been much higher (or much lower) than the predicted value.
Section 3.2 (still) – Influential Observations
•
• Define influential points
Determine if a data point is an outlier or an influential point
• Interpret your findings.
Influential Points
•
•
•
•
An observation that, when removed from the data, it makes a distinct
change in the result of the calculation(s).
§ Slope
§ Correlation
§ Coefficient of Determination
Influential Point - an outlier with a small residual value.
Often, not always, they are outliers in the “x” direction of a scatterplot.
Located in an extreme position on the explanatory scale.
Is it an Influential Point?
To Test Influential Points
•
•
•
•
Calculate correlation, least – squares regression, etc. WITH the
observation (potential influential point) included in the data.
Then, REMOVE the potential influential point and run the calculations
again.
COMPARE results to see if it made a strong influence on the calculations.
Graph the data without the potential influential point on a scatterplot with
the LSRL. Is it a better fit of the data?
Continuation From Section 3.3 (again) – Residuals (from yesterday)
•
•
Use the example about the age at which children speak and their test score
on the Gesell exam to complete the following.
You may have to retype the data or rerun the information if you deleted it.
j) ONLY Remove Child 19 from the data. Construct:
• A scatterplot
• Least – squares regression (into Y2) (keep original LSRL in
Y1)
• Correlation.
On the scatterplot, compare your original least squares – regression (Y1) and the
new one (Y2). Write down the new regression line and correlation.
k) Put Child 19 back into your data lists.
l) ONLY Remove Child 18 from the data. Construct:
• A scatterplot
• Least – squares regression (into Y2) (keep original LSRL in
Y1)
• Correlation.
On the scatterplot, compare your original least squares – regression (Y1) and the
new one (Y2). Write down the new regression line and correlation.
m) Compare the regression line in part j (When Child 19 is removed) with the
original regression line from yesterday. Does Child 19 appear to be an influential
point? If so, why?
n) Compare the regression line in part l (When Child 18 is removed) with the
original regression line from yesterday. Does Child 18 appear to be an influential
point? If so, why?
Mini – Tab
•
•
•
Will not use the program but we will interpret information and outputs
provided in the book.
For the AP, you need to be able to read the information provided in a
Mini – Tab output
This output is on Page 156 in the textbook.