Download unit three - KSU Web Home

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
1
UNIT THREE
Chi-square test of independence; correlation coefficient; regression analysis
CHI-SQUARE TEST OF INDEPENDENCE
A chi-square test of independence is designed to assess whether, for some population of entities,
two categorical variables are independent of one another. (If they are not independent, then they are
related.) For example, you might wonder whether, for U.S. undergraduates, gender (with the two
categories Male and Female) and college major (with the six categories Humanities or Social Sciences,
Arts, Math or Science, Business, Education, and Other) are independent. If the variables were
independent, that would mean that--for each possible major--the proportion of males choosing that major
equals the proportion of females choosing that major. Alternatively one could say that if the variables
were independent, the percentage breakdown of males by major would be identical to the percentage
breakdown of females by major. One could test H0: gender and college major are independent vs. H1:
gender and college major are related by gathering gender and college major data on a random sample of
U.S. undergraduates. The hypothesis testing procedure is outlined on the last page of this packet.
Incidentally, even quantitative variables can be “cast” as categorical variables. For example, the variable
Income can have as possible categories: “below $30,000,” “$30,000-$60,000,” and “above $60,000.”
Because the test statistic in a chi-square test of independence has approximately a chi-square
distribution (hence the term chi-square in the name of the test), here is a brief introduction to that
distribution. There is a family of chi-square distributions; each member of the family has a certain
number (v, where v is a positive integer) degrees of freedom. The chi-square distribution with v degrees
of freedom is the distribution of the sum of the squares of v independent standard normal variables, and
has a mean value equal to v. The figure below depicts four chi-square distributions.
Figure. Distributions depicted, from “tallest” to “shortest”: chi-square distribution with 1, 2, 3, 5, and
10 degrees of freedom, respectively.
2
PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT
(typically simply called correlation coefficient)
The correlation coefficient is a measure of the strength and direction of the linear association
between two quantitative variables X and Y. A correlation coefficient can be determined for an entire
population of paired (x,y) observations or for a sample of paired (x,y) observations In essence, the
correlation coefficient measures the extent to which a unit increase in the value of X is associated with a
specific change (whether an increase or decrease) in the value of Y.  denotes the population correlation
coefficient and r denotes the sample correlation coefficient. Each of  and r falls in the interval [-1,1].
 xy

, where  xy , called the population covariance of X and Y, is defined for a finite population
 x y
N
as  xy 
 (x
i 1
i
  x )( yi   y )
N
with (xi,yi) denoting the ith paired observation in the population.
n
r
s xy
sx s y
, where s xy , called the sample covariance of X and Y, is defined as s xy 
 (x
i 1
i
 x)( yi  y )
(n  1)
with (xi,yi) denoting the ith paired observation in the sample.
Unlike the correlation coefficient, the covariance will change when the unit of measurement for X
or for Y is changed. For example, a change in unit from thousands of dollars to dollars would increase
the magnitude of (absolute value of) the covariance but leave the correlation coefficient unchanged;
conversely, a change in unit from inches to feet would decrease the magnitude of (absolute value of) the
covariance but leave the correlation coefficient unchanged.
As an example of a sample correlation coefficient, consider, with respect to a sample of nine U.S.
airlines, 1998 data on the variables X (% flights on time) and Y (complaint rate). The sample of nine
(x,y) observations pictured in the scatter diagram below yielded r = -.88, which would be referred to as a
high negative correlation, and indicates (for the sample data) a strong inverse (or negative) linear
association between % flights on time and the complaint rate.
No. of complaints (per
100,000 passengers)
1998 data on a sample of U.S. airlines
(r = -.88)
1.4
1.2
1
0.8
0.6
0.4
0.2
0
65
70
75
80
% of flights arriving on time
85
3
As another example, consider, with respect to states in the mid-1990s, the variables X (spending per
pupil) and Y (mean composite score of students on the NAEP test). For a sample of 35 states, the (x,y)
observations pictured below yielded r = .34, a low positive correlation, indicating (for the sample data) a
weak direct (or positive) linear association between spending per pupil and average student performance.
Composite Score on
NAEP test
Mid-90s data on Spending & Student
Achievement for 35 States (r = .34)
700
650
600
550
3,000 4,000 5,000 6,000 7,000 8,000 9,000
Spending per Pupil ($)
REGRESSION ANALYSIS
Regression analysis deals with developing a statistical model relating some quantitative variable (typically
labeled Y, and called the dependent variable or criterion variable) to one or more other variables
(typically labeled X1, X2, etc., and called the independent variables or the predictor variables or the
explanatory variables), at least one of which is quantitative. Regression analysis has various purposes,
including:
(1)
(2)
(3)
prediction, that is, to predict the value of some dependent variable given the value(s) of the
independent variable(s). For example, a chain of portrait studios located in medium-sized cities
and specializing in childrens’ portraits may want to predict annual sales of a studio in a city from
the number of people in the city 16 years of age or less and the per capita disposable personal
income of the city’s residents.
estimating effects, that is, to estimate the impact of changes in the value of an independent
variable on the value of the dependent variable. For example, a real estate broker may wish to
estimate, for houses in a particular community, the impact of each additional bathroom (while
controlling for size of house, size of lot, number of bedrooms, and age of house) on the sales price.
testing theories postulating particular relationships between a dependent variable and one or more
independent variables.
In simple regression, there is exactly one independent variable. In multiple regression, there are two or
more independent variables.
4
SIMPLE LINEAR REGRESSION MODEL
The (classical normal) simple linear regression model is expressed in the form
Y = ß0 + ß1X + ,
where:
* Y and X are quantitative variables
* ß0 and ß1 are real numbers
*  is called the residual term or error term
The model is referred to as linear in the parameters ß0 and ß1. Assumptions of the model include (with x
denoting any value in the presumed domain of X):
Across all entities with X = x:
1. E() = 0 ( E(Y) = ß0 + ß1x). This may be called the linearity assumption.
E(Y) = ß0 + ß1X is called the population regression equation (or population regression line). It
follows from this assumption that an entity with  > 0 has an above average value for Y given its
value of X, and an entity with  < 0 has a below average value for Y given its value of X.
2.  is normally distributed (Y is normally distributed). This may be called the normality
assumption.
3. VAR() = 2 (VAR(Y) = 2). This may be called the equal variances assumption. (Another
name for equal variances is homoskedasticity. A name for unequal variances is heteroskedasticity.)
For any two entities:
4. the respective ’s are independent ( the respective Y’s are independent). This may be called the
independence assumption.
The mnemonic LINE (L for linearity, I for independence, N for normality, and E for equal variances) can
assist in remembering these assumptions. It is also implicitly assumed that the model is correctly
specified (e.g., correct functional form; no missing independent variables).
SIMPLE LINEAR REGRESSION ANALYSIS
(delegating calculations to Excel)
Once a random sample of n entities has been drawn, the values of X and Y for those entities have been
determined, and an Excel printout containing the results of applying Excel’s REGRESSION tool to the
sample data has been obtained, all of the following activities related to a regression analysis can be
performed.
(a) Constructing a scatter diagram.
A scatter diagram is a plot of the (x,y) data points in the sample.
(b) Graphing (by hand) the sample regression equation on the scatter diagram.
The sample regression equation (which with only one independent variable may be called the least
squares line) is given by est(Y) = b0 + b1X, where est(Y) is verbalized as "estimated Y" and where b0
(the y-intercept) and b1 (the slope) are constants chosen so as to minimize  [y - est(y)]², where the
sum is taken across the n data points in the sample. Thus, the least squares line is the line for which
the sum of the squared vertical distances between the data points in the sample and the line is
5
minimized. The sample regression equation is an estimate of the population regression equation, E(Y)
= ß0 + ß1X.
Where do you find the values for b0 (the y-intercept) and b1 (the slope) on the Excel printout? In the
Coefficients column of the long narrow table, with the value of b0 provided next to the term Intercept
and the value of b1 provided next to the name of X. How do you graph the line by hand? After
writing down the equation of the line, pick any value for X and solve for est(Y). Then pick any other
value for X and solve for est(Y). Plot the two points and then draw a line through them.
(c) Interpreting the standard error of the estimate (se).
The standard error of the estimate, se, is an estimate of the standard deviation in Y across all entities
having the same value of X. To interpret an se of 2.7, for example, one could say: Across all entities
having the same value of X, the standard deviation in Y is estimated to be 2.7 units. Where do you
find the standard error of the estimate on the Excel printout? In the SUMMARY OUTPUT table next
to the term Standard Error.
(d) Constructing a residual plot to informally assess whether model assumptions appear met.
Each entity in the sample has a sample residual score of e = y - est(y), where est(y) is determined by
substituting the entity's value for X into the sample regression equation and solving for est(y). e is an
estimate of the true residual term  associated with an entity. The sample residuals (residual scores)
are provided on the Excel printout. A residual plot is a graph of all the (x, e) points for the entities in
the sample. This plot appears on the Excel printout.
The linearity assumption will appear to be met if, when you visually scan the plot from left to right,
the points appear to fluctuate about the horizontal 0 line.
The normality assumption will appear to be met if roughly 68% of the residuals are between -se and
se, and roughly 95% of the residuals are between –2se and 2se.
The equal variances assumption will appear to be met if, when you visually scan the plot from left to
right, the vertical scatter of the points about the horizontal 0 line remains roughly the same.
Checking the independence assumption requires a natural ordering of the data (e.g., by time, or on the
basis of a variable not included in the model). The independence assumption will appear to be met
if in a residual plot with the residuals plotted from left to right in a natural ordering (e.g., by time, or
on the basis of a variable not included in the model), there is no systematic pattern in the points.
The confidence intervals and test statistics referred to below exactly apply only when all the
model assumptions are met. Should one or more assumptions appear violated, there are ways to
transform the data to try and correct the problem (but we will not be discussing those ways.)
6
(e) Interpreting SST, SSR, SSE, and r2.
Each sum below is taken over all the data points in the sample. Thus, the four measures are sample
measures.
SST, called sum of squares total, measures the total variation in the y-values. SST =  (y -y)²
SSR, called sum of squares regression, measures the variation in the y-values explained by the
sample regression equation. SSR =  [est(y) -y]²
SSE, called sum of squares error, measures the variation in the y-values not explained by the
sample regression equation. SSE =  [y - est(y)]². note: the values for b0 and b1 in the sample
regression equation are chosen so as to minimize SSE.
SST = SSR + SSE.
r2, called the sample coefficient of determination, measures the proportion (percentage) of the total
variation in the y-values that is explained by the sample regression equation (i.e., explained by
the variation in X). To interpret an r2 of .87, for example, one could say: 87% of the variation in the
y-values is explained by the sample regression equation. A high r2 is necessary, but not sufficient, for
being able to use the sample regression equation to predict Y from X with reasonable precision and
certainty. Where do you find r2 on the Excel printout? In the SUMMARY OUTPUT table next to the
term R Square.
(f) Testing for a linear relationship between Y and X.
contextual note: There is a family of what are called F-distributions, each with a certain number of
numerator degrees of freedom (d.f.) and a certain number of denominator degrees of freedom. Fdistributions are positively skewed. The definition of an F-distribution and a graph of four members
of the family may be found at http://www.itl.nist.gov/div898/handbook/eda/section3/eda3665.htm.
If the model assumptions are met, an appropriate test statistic for testing
H0: Y is not linearly related to X (i.e., ß1 = 0), versus
H1: Y is linearly related to X (i.e., ß1  0)
is F = MSR/MSE, where MSR = SSR/1 and MSE = SSE/(n-2). MSE is called mean square error. It is
an estimate of ², the variance of  (same as variance of Y) across all entities having the same value of
X. MSR is called mean square regression. If H0 is true, the expected value of MSR is ². However,
if H1 is true, the expected value of MSR will be greater than ². For that reason, if Fcalculated is large
(what’s considered large depends on the sample size), H1 is supported. Consequently, the p-value is
calculated as p = P(F  Fcalculated), where Fcalculated is the value of F calculated from the sample of (x,y)
data points. Where do you find Fcalculated and its associated p-value on the Excel printout? In the
ANOVA table. Fcalculated is given in the F column. The p-value is given in the Significance [of] F
column.
The theorem underlying the above choice of test statistic is: If the model assumptions are met and H0
is true (i.e., there is no linear relationship between Y and X, i.e., ß1 = 0), then the sampling distribution
of F = MSR/MSE is the F-distribution with 1 numerator d.f. and n-2 denominator d.f.
7
(g) Obtaining a point estimate for the mean value of Y across all entities having X = x*.
To get a point estimate (single-number “best guess”) for the mean value of Y across all entities having
X = x*, substitute x* for X in the sample regression equation and solve for est(Y). (One should
also—though due to time constraints, we will not—obtain a confidence interval for the mean Y across
all entities having X = x*.)
(h) Obtaining a point estimate for the value of Y for a single entity having X = x*.
To get a point estimate (single-number “best guess”) for the value of Y for a single entity having X =
x*, substitute x* for X in the sample regression equation and solve for est(Y). (One should also—
though due to time constraints, we will not—obtain a confidence interval for the value of Y for a
single entity having X = x*.)
(i) Interpreting b1, the sample regression coefficient of X.
The sample regression coefficient of X--which in a simple regression context is the slope b1 of the
least squares line--is an estimate of the change in the mean Y with each additional unit increase in X.
More specifically, it is (the mean Y across all possible entities having X = x + 1) - (the mean Y across
all possible entities having X = x) for any x and x +1 in the domain of X under examination. To
interpret a coefficient of +3, for example, one could say “We estimate that as X increases by 1 unit,
the mean Y increases by 3 units.” To interpret a coefficient of -3, for example, one could say “We
estimate that as X increases by 1 unit , the mean Y decreases by 3 units.”
(j) Determining a confidence interval for ß1, the (true) change in the mean Y with each additional
unit increase in X.
When the model assumptions are met, this confidence interval is given by b1  t  sb1 , where b1 is
the sample regression coefficient of X (i.e., the slope of the least squares line), t is from the tdistribution with n-2 df, and sb1 (called the standard error of the coefficient) is an estimate (derived
from the sample data) of the standard deviation in the sample regression coefficient b1 (slope of least
squares line) over repeated sampling of n entities. A 95% confidence interval for ß1 is provided on the
Excel printout in the “Lower 95%” and “Upper 95%” columns and second row of the long narrow
table. To interpret a 95% confidence interval of (.5,.8), for example, one could say: We are 95%
confident that as X increases by 1 unit, the mean Y increases by between .5 and .8 units.
(k) Suggesting two additional independent variables which might enable one to better predict or
estimate Y.
This is a judgment call on your part. Any reasonable answer will be accepted.
Precautions associated with regression analysis
(1) Check the validity of the model assumptions before relying on procedures predicated on those
assumptions.
(2) Don't extrapolate, that is, do not attempt to predict or estimate the dependent variable for values of
the independent variables outside the region of values of the independent variables encompassed by
the sample data.
(3) Don't infer cause and effect from evidence of a linear relationship.
(4) Don't discard an outlier unless there is justification for doing so. (Outliers for any distribution are
"quite far"--say, 2.5 standard deviations or more--from the mean.)
8
Practice Problems for Test #3
1. A financial consultant wishes to determine whether, for firms in a given industry, there is a relationship
between firm asset size and capital structure. A random sample of 134 firms in the industry was
selected, and cross-classified as follows:
capital structure
debt less than or = equity
debt greater than equity
low
14
30
firm asset size
medium
20
36
high
16
18
Does this sample data provide sufficient evidence to conclude that firm asset size and capital structure
are related?
2. An operations manager at a textile mill would like to be able to predict total monthly production costs
from the monthly output of textile. For a random sample of 9 months, the production costs
(PRODCOST, in $1000's) and output of textile (OUTPUT, in tons) were determined. On the next page
is a scatter diagram and results of an Excel-generated regression analysis on this data. In a testing
situation, you will be given such Excel-generated output.
(a)
(b)
(c)
Superimpose on the scatter diagram a graph of the least squares line (i.e., the sample regression
equation) and label two points belonging to the least squares line.
Assess from the residual plot whether each of the linearity, normality, and equal variances
assumptions appears to be met.
Interpret each of SST, SSR, SSE, and r2 in the context of this problem.
Assume for the remaining questions that all model assumptions are satisfied.
(d)
Test for a linear relationship between production cost and output.
(e)
Interpret the sample regression coefficient of output (slope of the least squares line) in the context of
this problem.
(f)
Interpret the standard error of the estimate in the context of this problem.
(g)
What would you predict the total production cost to be for a month during which 5 tons of textile is
produced?
(h)
Suggest two additional variables that might help in predicting monthly production costs.
3.
For problem #3, state the assumptions of the simple linear regression model in the context of the
problem.
4.
(a) What does a correlation coefficient measure? (b) What type of correlation would you expect
between, for operating managers, number of years of managerial experience and most recent job
performance rating?
9
Printout to accompany Practice Problem 2.
note: X = output (in tons) and Y = production costs (in $1000's)
Cost
1
2
4
8
6
5
8
9
7
9
2
3
4
7
6
5
8
8
6
Production Cost (in $1000's)
Output
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
6
8
Output (in tons)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.985324502
R Square
0.970864373
Adjusted R
Square
0.966702141
Standard
Error
0.388285084
Observations
9
ANOVA
df
Regression
Residual
Total
Intercept
Output
SS
MS
F
1 35.16687 35.16687 233.2557
7 1.055357 0.150765
8 36.22222
Significance F
1.24E-06
Standard
Error
Coefficients
t Stat
P-value Lower 95% Upper 95%
1.267857143 0.302549 4.19058 0.004083 0.552442 1.983272
0.751785714 0.049224 15.27271 1.24E-06 0.635389 0.868182
Residual Plot
RESIDUAL OUTPUT
Predicted
Cost
2.019642857
2.771428571
4.275
7.282142857
5.778571429
5.026785714
7.282142857
8.033928571
6.530357143
1.17
Residuals
-0.019643
0.228571
-0.275
-0.282143
0.221429
-0.026786
0.717857
-0.033929
-0.530357
0.78
Residual
Observation
1
2
3
4
5
6
7
8
9
0.39
0
-0.39 0
2
4
-0.78
-1.17
Output
10
10
answers:
1.
Ho: firm asset size and capital structure are independent
H1: firm asset size and capital structure are not independent
2 =  (f-e)2/e = (14-16.4)²/16.4 + (20-20.9)²/20.9 + (16-12.7)²/12.7 +
(30-27.6)²/27.6 + (36-35.1)²/35.1 + (18-21.3)²/21.3 = 2.00
p = P(2  2.00); .10 < p < .90 [note: use the ²-distribution with 2 d.f.]
conclude firm asset size and capital structure are independent
2.
(a) The least squares line is est(Cost) = 1.268 + .752(Output). Two points belonging to that line are (2,2.8)
and (8,7.3). Plot, label, and draw a line through the 2 points you used.
(b) The linearity, normality, and equal variances assumptions appear met.
(c) SST represents the total variation in the production costs.
SSR represents the variation in the production costs explained by the sample regression equation.
SSE represents the variation in the production costs not explained by the sample regression equation.
Interpretation of r2 = .971: approximately 97.1% of the variation in the production costs is explained by
the sample regression equation
(d) Ho: production cost is NOT linearly related to output
H1: production cost is linearly related to output
F = MSR/MSE = 233.256
p = P(F  233.256)  0
very strong evidence that production cost is linearly related to output
(e) Interpretation of b1 = .752: We estimate that as output increases by 1 ton, the mean production cost
increases by .752 thousands of dollars (i.e., $752).
(f) Interpretation of se = .388: Across all months with the same output level, the standard deviation in the
production costs is estimated to be .388 thousands of dollars (i.e., $388)
(g) 5.028 thousands of dollars (i.e., $5,028)
(h) the type of production technology; the experience level of the workers
3. Across all months with the same output: E() = 0,  is normally distributed, and VAR() = 2. For any
two months, the respective ’s are independent.
4. (a) A correlation coefficient measures the strength and direction of linear association between two
variables. (b) a low to moderate positive correlation
11
Answers to Homework Problems in Chapters 11 and 3
Chapter 11
21. H0: The type of flight and type of ticket of persons traveling for business are independent.
H1: The type of flight and type of ticket of persons traveling for business are not independent
(which is the same thing as saying they are related).
2
 =  (f - e)²/e = 100.43. [The e’s are 35.59, 15.41, 150.73, 65.27, 455.68, and 197.32]
p = P(2  100.43)
p < .005 [Look at chi-square table with df = (3-1)(2-1) = 2.]
Very strong evidence that the type of flight and type of ticket of persons traveling for business are not
independent, i.e., are related.
22. H0: Brand loyalty and manufacturer are independent.
H1: Brand loyalty and manufacturer are not independent (which is the same thing as saying
they are related).
2
 =  (f - e)²/e = 7.36 [The e’s are 109.53, 66.13, 72.33, 155.47, 93.87, and 102.67]
p = P(2  7.36)  .025 [Look at chi-square table with df = (2-1)(3-1) = 2.]
Evidence that brandy loyalty and manufacturer are not independent, i.e., are related.
40. H0: Part quality and production shift are independent.
H1: Part quality and production shift are not independent
(which is the same thing as saying they are related).
2
 =  (f - e)²/e = 8.11. [The e’s are 368.44, 31.56, 276.33, 23.67, 184.22, and 15.78]
p = P(2  8.11) [Look at chi-square table with df = (3-1)(2-1) = 2.]
.01 < p < .025
Evidence that part quality and production shift are not independent, i.e., are related.
Chapter 3
50. r = -.91. For this sample of data on 10 midsize automobiles, there is a strong inverse linear
association between the driving speed and mileage (in miles per gallon).
mpg
Data on Sample of 10 midsize
automobiles
40
35
30
25
20
15
10
5
0
0
20
40
speed
60
80
12
52. r = .92. For this sample of 10 weeks, there is a strong direct linear association between the closing
price for the DJIA and the S&P 500.
S&P500
Closing Prices for Sample of Weeks in
Feb, March, and April of 2000
1550
1500
1450
1400
1350
1300
9500
10000
10500
11000
11500
DJIA
69b. r = .93 which indicates that, for this sample of 20 U.S. cities, there is a strong direct linear
association between typical (median?) household income and typical (median?) home price.
Home Price (in $1000s)
Data on Sample of 20 U.S. Cities (from
Places Rated Almanac, 2000)
250
200
150
100
50
0
0
50
100
Household Income (in $1000s)
150
13
Regression Homework Assignment (followed by answer key)
1. A clothes manufacturer wanted to assess the relationship between the annual maintenance cost (Y, in
dollars) and age (X, in years) of a particular variety of sewing machine. For a random sample of 14
machines of this type, maintenance records were examined to determine, for each machine, its
maintenance cost last year and its age at the beginning of last year. The data is given below:
Age
8
3
1
9
5
7
5
2
1
3
6
2
6
8
Cost
118
55
21
135
75
104
83
40
29
48
85
33
95
130
note: Refer as needed to the Excel-generated output on the next page.
(a) Write down the sample regression equation.
(b) Graph the sample regression equation (least squares line) on the scatter diagram, and label two of the
points on that line.
(c) Interpret—in the context of the problem—the standard error of the estimate.
(d) Assess, based on the residual plot provided by Excel, whether the linearity, normality, and equal
variances assumptions appear met.
(e) Interpret—in the context of the problem—each of SST, SSR, SSE, and r2.
(f) Test for a linear relationship between annual maintenance cost and age.
(g) Get a point estimate for the mean maintenance cost of all 7 year old machines.
(h) Get a point estimate for the annual maintenance cost of a 3 year old machine.
(i) Interpret—in the context of the problem—the sample regression coefficient of age (which is the slope
of the least squares line).
(j) Suggest two additional independent variables that could aid in predicting the annual maintenance cost
of a sewing machine.
14
Homework Problem #1. Sample data and Excel-generated output.
Cost
8
3
1
9
5
7
5
2
1
3
6
2
6
8
118
55
21
135
75
104
83
40
29
48
85
33
95
130
Cost (in dollars)
Age
160
140
120
100
80
60
40
20
0
0
2
4
6
8
10
Age (in years)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9923999
R Square
0.9848576
Adjusted R Square 0.9835957
Standard Error 4.9002087
Observations
14
ANOVA
df
Intercept
Age
SS
MS
F
1 18740.78 18740.78 780.4743
12 288.1445 24.01205
13 19028.93
Significance F
2.74E-12
Standard
Error
Coefficients
t Stat
P-value Lower 95%Upper 95%
9.4955752 2.687911 3.532698 0.004126 3.639121 15.35203
13.910029 0.497908 27.93697 2.74E-12 12.82518 14.99488
RESIDUAL OUTPUT
Predicted
Cost
Observation
1 120.77581
2 51.225664
3 23.405605
4 134.68584
5 79.045723
6 106.86578
7 79.045723
8 37.315634
9 23.405605
10 51.225664
11 92.955752
12 37.315634
13 92.955752
14 120.77581
Residual Plot
Residuals
-2.775811
3.774336
-2.405605
0.314159
-4.045723
-2.865782
3.954277
2.684366
5.594395
-3.225664
-7.955752
-4.315634
2.044248
9.224189
14.7
9.8
Residuals
Regression
Residual
Total
4.9
0
-4.9
0
2
4
6
-9.8
-14.7
Age
8
10
15
2. The owner of a large firm that manufactures furniture wishes to assess the relationship in the U.S.
between annual national expenditures on furniture (Y) and national personal disposable income (X).
Below are data, for 10 years and in billions of dollars, from the Economic Report of the President.
[source: Kenkel (1996)]
Expenditures
on Furniture
PDI
20
18
22
24
26
30
30
30
38
40
350
364
385
404
438
473
511
546
591
634
note: Refer as needed to the Excel-generated output on the next page.
(a) Write down the sample regression equation.
(b) Graph the sample regression equation (least squares line) on the scatter diagram, and label two of the
points on that line.
(c) Interpret—in the context of the problem—the standard error of the estimate.
(d) Assess, based on the residual plot provided by Excel, whether the linearity, normality, and equal
variances assumptions appear met.
(e) Interpret—in the context of the problem—each of SST, SSR, SSE, and r2.
(f) Test for a linear relationship between furniture expenditures and PDI.
(g) Get a point estimate for the mean furniture expenditure over years with a PDI of 425 billion dollars.
(h) Get a point estimate for the furniture expenditure in a single year for which the PDI is anticipated to be
500 billion dollars.
(i) Interpret—in the context of the problem—the sample regression coefficient of PDI, i.e., the slope of
the least squares line..
(j) Suggest two additional independent variables that could aid in predicting furniture expenditures.
16
Homework Problem #2. Sample data and Excel-generated output.
Expend
350
364
385
404
438
473
511
546
591
634
Expenditures on Furniture
(in $ billions)
PDI
20
18
22
24
26
30
30
30
38
40
50
40
30
20
10
0
0
200
400
600
800
PDI (in $ billions)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9741882
R Square
0.9490426
Adjusted R Square 0.9426729
Standard Error 1.7405227
Observations
10
ANOVA
df
Regression
Residual
Total
SS
MS
F
1 451.3646 451.3646 148.9938
8 24.23535 3.029419
9
475.6
Significance F
1.88E-06
Standard
Error
Coefficients
t Stat
P-value Lower 95%Upper 95%
-5.977543 2.821428 -2.118623 0.066969 -12.48377 0.528687
0.0719283 0.005893 12.2063 1.88E-06 0.05834 0.085517
Intercept
PDI
Residual Plot
RESIDUAL OUTPUT
5.22
1
2
3
4
5
6
7
8
9
10
Residuals
0.802628
-2.204369
0.285137
0.918498
0.472935
1.955444
-0.777833
-3.295324
1.467901
0.374983
3.48
Residuals
Observation
Predicted
Expend
19.197372
20.204369
21.714863
23.081502
25.527065
28.044556
30.777833
33.295324
36.532099
39.625017
1.74
0
-1.74
0
200
400
-3.48
-5.22
PDI
600
800
17
3. “The Jones Rustproofing Company operates a chain of outlets in Chicago. The company rustproofs
automobiles. Management believes that the number of customers [Y] in a quarter of the year can be
predicted relatively accurately by using a linear regression model in which the explanatory variable is the
number of new automobile registrations [X] in Chicago in the previous quarter. The following data show
the number of customers in hundreds during the last eight quarters and the number of new car
registrations in thousands for each previous quarter.” (Kenkel, 1996)
# Customers
7.1
8.2
6.3
9.1
8.7
6.4
5.2
8.1
# New autos registered
14.4
17.1
11.9
20.2
17.0
14.0
11.1
15.2
note: Refer as needed to the Excel-generated output on the next page.
(a) Write down the sample regression equation.
(b) Graph the sample regression equation (least squares line) on the scatter diagram, and label two of the
points on that line.
(c) Interpret—in the context of the problem—the standard error of the estimate.
(d) Assess, based on the residual plot provided by Excel, whether the linearity, normality, and equal
variances assumptions appear met.
(e) Interpret—in the context of the problem—each of SST, SSR, SSE, and r2.
(f) Test for a linear relationship between number of customers and number of new autos registered.
(g) Get a point estimate for the mean number of customers across all quarters where 18,500 new autos
were registered the previous quarter.
(h) Get a point estimate for the number of customers next quarter if there were 16,800 new autos
registered this quarter.
(i) Interpret—in the context of the problem—the sample regression coefficient of number of new autos
registered, i.e., the slope of the least squares line.
(j) Suggest two additional independent variables which could aid in predicting the number of customers.
NewAutoRegs Customers
14.4
7.1
17.1
8.2
11.9
6.3
20.2
9.1
17
8.7
14
6.4
11.1
5.2
15.2
8.1
Number of Customers (x100)
during quarter
18
10
9
8
7
6
5
4
3
2
1
0
0
5
10
15
20
25
New Auto Registrations (x1000) previous quarter
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9400942
R Square
0.88377711
Adjusted R Square 0.86440663
Standard Error 0.49888523
Observations
8
ANOVA
df
Regression
Residual
Total
SS
MS
F
Significance F
1 11.35543 11.35543 45.62494 0.000514
6 1.493319 0.248886
7 12.84875
Standard
Error
Coefficients
t Stat
P-value Lower 95%Upper 95%
0.8973018 0.976908 0.918512 0.393776 -1.493107 3.287711
0.42945894 0.06358 6.754624 0.000514 0.273884 0.585034
Intercept
NewAutoRegs
Residual Plot
RESIDUAL OUTPUT
1
2
3
4
5
6
7
8
Predicted
Customers
7.08151051
8.24104964
6.00786316
9.57237235
8.19810375
6.90972693
5.66429601
7.42507766
Residuals
0.018489
-0.04105
0.292137
-0.472372
0.501896
-0.509727
-0.464296
0.674922
1.5
1
Residuals
Observation
0.5
0
-0.5 0
5
10
15
-1
-1.5
NewAutoRegs
20
25
19
4. To assess the effect of an organic fertilizer on tomato yield, differing amounts of organic fertilizer
were applied to 10 similar plots of land. The same number and variety of tomato seedlings were grown
on each plot under similar growing conditions. For each plot the amount of fertilizer (in pounds) and
yield (in pounds) of tomatoes throughout the growing season are given below:
Fertilizer
0
0
10
10
20
20
30
30
40
40
Yield
6
8
11
14
18
23
25
28
30
34
note: Refer as needed to the Excel-generated output on the next page.
(a) Write down the sample regression equation..
(b) Graph the sample regression equation (least squares line) on the scatter diagram, and label two of the
points on that line.
(c) Interpret—in the context of the problem—the standard error of the estimate.
(d) Assess, based on the residual plot provided by Excel, whether the linearity, normality, and equal
variances assumptions appear met.
(e) Interpret—in the context of the problem—each of SST, SSR, SSE, and r2.
(f) Test for a linear relationship between yield and amount of fertilizer.
(g) Get a point estimate for the yield of a plot where 35 pounds of fertilizer is to be used.
(h) Get a point estimate for the mean yield across all plots were 15 pounds of fertilizer is to be used.
(i) Interpret—in the context of the problem— the sample regression coefficient of fertilizer (which is the
slope of the least squares line).
20
Homework Problem #4. Sample data and Excel-generated output.
Yield
0
0
10
10
20
20
30
30
40
40
6
8
11
14
18
23
25
28
30
34
Yield (in pounds)
Fertilizer
40
35
30
25
20
15
10
5
0
0
10
20
30
40
50
Fertilizer (in pounds)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.97935605
R Square
0.95913827
Adjusted R Square 0.95403056
Standard Error 2.08865986
Observations
10
ANOVA
df
Regression
Residual
Total
1
8
9
SS
819.2
34.9
854.1
MS
819.2
4.3625
F
187.782235
Standard
Error
Coefficients
t Stat
6.9 1.14400612 6.031436
0.64 0.04670385 13.70337
Intercept
Fertilizer
Residual Plot
6.3
4.2
Residuals
1
2
3
4
5
6
7
8
9
10
Predicted
Yield
Residuals
6.9
-0.9
6.9
1.1
13.3
-2.3
13.3
0.7
19.7
-1.7
19.7
3.3
26.1
-1.1
26.1
1.9
32.5
-2.5
32.5
1.5
7.75E-07
P-value
Lower 95% Upper 95%
0.000312279 4.261915 9.538085
7.75086E-07 0.532301 0.747699
RESIDUAL OUTPUT
Observation
Significance F
2.1
0
-2.1 0
10
20
30
-4.2
-6.3
Fertilizer
40
50
21
Answers to Regression Homework
note: Some of our answers may differ due to rounding.
1.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
The sample regression equation is est(Cost) = 9.4956 + 13.9100(Age)
Two of the many points belonging to the line are (3,51.23) and (8,120.78).
Across all machines of the same age, the standard deviation in the annual maintenance cost is
estimated to be $4.90.
The linearity, normality, and equal variances assumptions appear met.
SST represents the total variation in the maintenance costs.
SSR represents the variation in the maintenance costs explained by the sample regression equation.
SSE represents the variation in the maintenance costs not explained by the sample regression
equation.
Interpretation of r2 = .985: 98.5% of the variation in the maintenance costs is explained by the
sample regression equation.
H0: annual maintenance cost is not linearly related to age (or 1 = 0)
H1: annual maintenance cost is linearly related to age (or 1  0)
F = MSR/MSE = 780.4743
p = P(F  780.4743)  0
Very strong evidence that annual maintenance cost is linearly related to age.
point estimate is 9.4956 + 13.9100(7) = $106.87
point estimate is 9.4956 + 13.9100(3) = $51.23
We estimate that as machine age increases by 1 year, the mean annual maintenance cost increases
by $13.91.
Hours of usage; percentage of time used on heavy fabric.
2.
(a) The sample regression equation is est(Expend) = -5.9775 + .0719(PDI).
(b) Two of the many points belonging to the line are (400,22.78) and (600,37.16).
(c) We estimate that across all years with the same PDI level, the standard deviation in national expenditures
on furniture is 1.74 billion dollars.
(d) Based on the residual plot, the linearity, normality, and equal variances assumptions appear met.
(e) SST represents the total variation in the annual national furniture expenditures.
SSR represents the variation in the annual national furniture expenditures explained by the sample
regression equation.
SSE represents the variation in the annual national furniture expenditures not explained by the sample
regression equation.
Interpretation of r2 = .949: 94.9% of the variation in the annual national expenditures on furniture is
explained by the sample regression equation.
22
(f) H0: annual national expenditure on furniture is not linearly related to national PDI (or 1 = 0)
H1: annual national expenditure on furniture is linearly related to national PDI (or 1  0)
F = MSR/MSE = 148.9938
p = P(F  148.9938)  0
Very strong evidence that annual national expenditure on furniture is linearly related to
national PCI.
(g) point estimate is -5.9775 + .0719(425) = 24.6 billion dollars
(h) point estimate is -5.9775 + .0719(500) = 30.0 billion dollars
(i) We estimate that as PDI increases by 1 billion dollars, the mean expenditure on furniture increases by
.072 billion dollars (or 72 million dollars).
(j) consumer price index (at the beginning of the year);
level of consumer confidence (at the beginning of the year)
3.
(a) The sample regression equation is est(Customers) = .8973 + .4295(NewAutoRegs).
(b) Two of the many points belonging to the line are (6.0,3.47) and (9.0,4.76).
(c) We estimate that across all quarters having the same number of new auto registrations the previous
quarter, the standard deviation in the number of customers is 50 (.50 hundred) customers.
(d) Based on the residual plot, the linearity, normality, and equal variances assumptions appear met.
(e) SST represents the total variation in the numbers of customers.
SSR represents the variation in the number of customers explained by the sample regression equation.
SSE represents the variation in the number of customers not explained by the sample regression
equation.
Interpretation of r2 = .884: 88.4% of the variation in the number of customers is explained by the
sample regression equation.
(f) H0: quarterly number of customers is not linearly related to the number of new auto registrations
the previous quarter (or 1 = 0)
H1: quarterly number of customers is linearly related to the number of new auto registrations the
previous quarter (or 1  0)
F = MSR/MSE = 45.6249
p = P(F  45.6249) = .0005
Very strong evidence that the quarterly number of customers is linearly related to the
number of new auto registrations the previous quarter.
(g) point estimate is .8973 + .4295(18.5) = 8.8 hundred or 880 customers
(h) point estimate is .8973 + .4295(16.8) = 8.1 hundred or 810 customers
(i) We estimate that as the number of new auto registrations the previous quarter increases by 1 thousand,
the mean number of customers in a quarter increases by .43 hundred (i.e., 43) customers.
(j) mean dollar value of new autos registered the previous quarter;
whether or not it is a fourth or first quarter (and thus overlaps with winter)
23
4.
(a) The sample regression equation is est(Yield) = 6.9 + .64(Fertilizer)
(b) Two points belonging to the equation/line are (0,6.9) and (40,32.5); if you plot those two points (or
any other two points belonging to the line) and draw a line through them, you will have a graph of the
least squares line.
(c) Across all plots with the same amount of fertilizer, the standard deviation in yield is estimated to be
2.1 pounds.
(d) All three assumptions appear met.
(e) SST: total variation in yield
SSR: variation in yield explained by the sample regression equation
SSE: variation in yield not explained by the sample regression equation
Interpretation of r2 = .959:  96% of the variation in yield is explained by the sample regression
equation
(f) H0: yield is not linearly related to amount of fertilizer (or 1 = 0)
H1: yield is linearly related to amount of fertilizer (or 1  0)
F = MSR/MSE = 187.782
p = P(F  187.782) = .0000007751  0
Very strong evidence that yield is linearly related to amount of fertilizer
(g) point estimate is 6.9 + .64(35) = 29.3 pounds
(h) point estimate is 6.9 + .64(15) = 16.5 pounds
(i) We estimate that as the amount of organic fertilizer increases by 1 pound, the mean yield increases by
.64 pounds.
24
Unit Three Formula Sheet
Chi-square test of independence (p-value approach):
H0: X and Y are independent
H1: X and Y are related
( f  e) 2
2
Test statistic:   
note: f denotes the observed frequencies (the counts within the crosse
classification, or contingency, table) and e denotes the expected [should H0 be true]
frequencies. For each “cell” of the table, e = (row total)(column total)/(grand total),
where n is the grand total.
p-value:
p = P(2  2 calculated GIVEN H0 is true). Reference the chi-square distribution
with (r – 1)(c – 1) df.
Conclusion:
p< .005
.005  p < .01 .01  p < .05
.05  p < .10
p  .10
Hypotheses:
very strong evidence strong evidence
evidence
marginal evidence H0 may be true
that H1 is true
that H1 is true that H1 is true
that H1 is true
Covariance and Correlation Coefficient
N
n
s xy 
 ( xi  x)( yi  y)
i 1
(n  1)
 xy 
 (x
i 1
i
  x )( yi   y )
N
r
s xy
sx s y

 xy
 x y
Simple Linear Regression Analysis:
Testing for a linear relationship between Y and X (p-value approach):
Hypotheses: H0: Y is not linearly related to X (or 1 = 0)
H1: Y is linearly related to X (or 1  0)
Test statistic: F = MSR/MSE
p-value:
p = P(F  Fcalculated GIVEN H0 is true)
Conclusion: see above
For your information (not responsible for):
Confidence Interval for the mean value of Y across all entities with X = x*
est ( y )  t ( se )
1
( x *  x) 2

2
n
 x2  nx
note: reference the t-distribution with n-2 df
Confidence Interval for the value of Y for a single entity with X = x*
est ( y )  t ( se ) 1 
1
( x *  x) 2
note: reference the t-distribution with n-2 df

2
2
n
x

n
x

__________
note: In any hypothesis testing situation, reject H0 at significance level  if and only if
the p-value is <  .