Download File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
Lesson 1: Section 12.1 (part 1)
objectives
 Check conditions for performing inference about the
slope β of the population (true) regression line.
 Interpret computer output from a least-squares
regression analysis.
ACTIVITY: Does Seat Location
Matter?
 Many people believe that students learn better if they
sit closer to the front of the classroom. Does sitting
closer cause higher achievement, or do better students
simply choose to sit in the front?
 To investigate, an AP Statistics teacher randomly
assigned students to seat locations in the classroom for
a particular chapter and recorded the test score for
each student at the end of the chapter.
 The explanatory variable in this experiment is which
row the student was assigned (Row 1 is closest to the
front and Row 7 is the farthest away)
ACTIVITY: Does Seat Location
Matter?
 Here are the results, including a scatterplot and least-square regression
line:
Row 1: 76, 77, 94, 99
Row 2: 83, 85, 74, 79
Row 3: 90, 88, 68, 78
Row 4: 94, 72, 101, 70, 79
Row 5: 76, 65, 90, 67, 96
Row 6: 88, 79, 90, 83
Row 7: 79, 76, 77, 63
1.) Identify and Interpret the slope of the least squares regression line in
this context.
2.) Explain why it is important to randomly assign the students to seats
rather than letting each student choose his or her own seat.
3.) Does the negative slope provide convincing evidence that sitting closer
causes higher achievement, or is it plausible that the association is due to
the chance variation in the random assignment? Complete the
simulation and find out!
ACTIVITY: Does Seat Location
Matter?
 Share your “P-values” from each group. Can we make a
any conclusions about these results?
Inference for Linear Regression
 Least-squares regression line for the population is
called the population regression line (or true
regression line) and written in the form y = α+βx
(PARAMETER)
 Least-squares regression line from a sample is called
the sample regression line (or estimated regression
line) and can be written in the form: yˆ  a  bx
yˆ  b0  b1 x
(STATISTIC)
 Every sample will have slightly different slopes and yintercepts due to sampling variation.
 www.rossmanchance.com/applets
Sampling
Distribution
of
b
We will talk about taking inference of the slope using
confidence intervals and significance tests – these are based
on the sampling distribution of b (the slope of the
sample regression line)
 Like any distribution, you can discuss the shape, center, and
spread.
 SHAPE: Is it roughly symmetric or unimodal? Or does a
normal probability plot appear linear?
 CENTER: The mean of all the sample slopes (b) should be an
unbiased estimator of the true slope.
 SPREAD: Standard deviation. We will discuss this later.

CONDITIONS for Regression
Inference
 REMEMBER LINER
 Linear: The actual relationship between x and y is




linear. For any fixed value of x, the mean response µ,
falls on the population (true) regression line µy = α+βx.
(α and β are the unknown parameters)
Independent: Individual observations are
independent of each other.
Normal: For any fixed value of x, the response y varies
according to a Normal distribution.
Equal variance: The standard deviation of y (call it σ)
is the same for all values of x. (σ is usually unknown)
Random: The data come from a well-designed random
sample or randomized experiment.
How
to
check
conditions
 Linear: Scatter plot: see if the pattern is overall linear.




Residual plot: see if the residuals center on the “residual = 0”
line at each x-value in the residual plot
Independent: Look at how the data were produced and
make sure each observation was independent from each
other. If the sampling is done without replacement, check
the 10% rule.
Normal: Make a stemplot, histogram, or Normal probability
plot of the residuals and check for skewness or other major
departures from Normality
Equal variance: Look at the scatter in the residual plot –
the values above and below the “residual = 0” should be
about the same from the smallest to largest x-value.
Random: See if the data were produced by random
sampling or a randomized experiment.
EXAMPLE
 Check the conditions for
performing inference
about the regression
model are met.
EXAMPLE:
Answer
 LINEAR: The scatterplot shows a weak linear relationship.





The residual plot does not show any obvious leftover patterns
indicating that this condition has been violated.
INDEPENDENT: Students are randomly assigned to seats
and were monitored for cheating, so knowing one student’s
score should give no additional information about another
student’s score.
NORMAL: The histogram of the residuals is roughly
unimodal and symmetric, and the Normal probability plot is
roughly linear.
EQUAL VARIANCE: Although there is a different amount of
variability in each row in the residual plot, the differences
aren’t large and there is no systematic pattern.
RANDOM: The students were assigned to seats at random.
Because there are no serious violations of the
conditions, we should be safe performing inference
about the regression model in this setting.
Back to ACTIVITY: Does seat
location
matter
 Here is the computer output for the least-squares regression
analysis on the seating-chart data from the previous
Alternate Activity:
Predictor
Constant
Row
Coef
85.706
-1.1171
SE Coef
4.239
0.9472
T
P
20.22 0.000
-1.18 0.248
 Problem: (a) State the equation of the least-squares
regression line. Define any variables you would use.
 (b) Interpret the slope, y-intercept (if possible), and
standard deviation of the residuals.
 (c) Preview: If we performed a significance test, would you
find convincing evidence that there was a negative
relationship between row number and test score?
Back to ACTIVITY: Does seat
location
matter
 Here is the computer output for the least-squares regression
analysis on the seating-chart data from the previous Alternate
Activity:
Predictor
Coef
SE Coef
T
P
Constant
85.706
4.239
20.22 0.000
Row
-1.1171
0.9472
-1.18 0.248
S = 10.0673
R-sq = 4.7% R-sq (adj) = 1.3%
 ANSWER: (a) yˆ  85.706  1.1171x where y hat = predicted score
and x = row number
 (b) Slope: For each additional row from the front of the class, the
test score is predicted to go down by 1.1171 points, on average.
Y-intercept: A value of x=0 does not make sense because we cannot
have 0 rows.
Standard deviation of the residuals: When we use the least-squares
regression line to predict test score from a student’s row number,
we will be off by about 10.0673 points, on average
homework
 Assigned reading: p. 739 - 744
 Complete HW problems: p. 759 #1-4
 Check answers to odd problems.