Download s209 module 1 - UNC Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Taylor's law wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
(1.1) Yi=0+1Xi + i i=1,2,...n
(1) Yi is the value of the DV for the ith observation (2) 0 & 1
are parameters (or coeffs) (3) Xi is the value of the IV (predictor V) for the ith observation, and can be
viewed as a known constant (4) i is a random error term with mean E{i}=0 & variance 2{i}=2
; i & j are uncorrelated so that their covariance is 0 (i.e., {i, j}=0 for all i,j; i<j)
The normal error regr model is:
(1.24) Yi = 0 + 1Xi + i i=1,2,...n
 Yi is the value of the dependent var for the ith observation

0 and 1 are parameters (or coefficients)
 Xi is the value of the independent (predictor) var for the ith observation, and can be
viewed as a known constant
 i are independent N(0, 2) [normally distributed with mean 0 and variance 2]
 This is the same model as (1.1) except that it assumes i are normally distributed. As a
consequence, assumption that i are uncorrelated becomes assumption of independence.
 The assumption of normality of the errors is justified for 2 principal reasons: (1) if the
error terms i are the sum of many independent unmeasured factors, the central limit
theorem implies that the distribution of i approaches normality as the number of
factors becomes large. (2) Normality of i implies tests based on the Student t
distribution. These tests are not sensitive to small departures from normality. Tests
based on the normality assumption are therefore robust.

MAXIMUM LIKELIHOOD ESTIMATION:
MODULE 1 - SIMPLE LINEAR REGR MODEL (SLRM)
(1) Regr analysis: "a statistical methodology that utilizes the relation between 2 or more quantitative
vars. so that one var. [the dependent or response var.] can be predicted from the other, or others [the
independent or predictor vars.]."(2) A functional relation is exact, - a single value of the DV Y
corresponds to a value of X. (3) A statistical relation is not exact, so that Y is not uniquely determined
as a func of X. There is random scatter around a trend representing the relationship of Y w/ X. (4) The
regr func of Y on X represents the relationship of the mean of the probability distribution of Y as a
func of X. (The regr func need not be linear.) (5) In the most general form, the regr model formalizes
the two ingredients of a statistical relationship: (A) there is a probability distribution of Y for each
level of X (representing the scatter of points around the main trend) (B) the means of these probability
distributions vary in some systematic fashion as a func of X
SLRM - DISTRIBUTION OF ERRORS UNSPECIFIED


This model is (1) simple [only 1 independent var] (2) linear in the parameters (3) linear in the
predictor var [X appears only as first power].
Properties of model 1.1: (1) Yi is a random var (RV), since it is a func of the random term
i (2) E{Yi} = 0 + 1Xi . Therefore the regr funct is E{Y} = 0 + 1Xi (3)
2{Yi}=2{0+1Xi+i}=2i=2, since the variance of the error i is assumed the
same regardless of the value of X (4) Yi & Yj are uncorrelated, since i & j are uncorrelated
The parameters 0 and 1 are the regr coefficients: (1) the slope 1 indicates the change
in the mean of the probability distribution of Y (i.e., E{Y}) per unit increase in X (2)
the intercept 0 corresponds to the mean of the probability distribution of Y (E{Y}) at
X=0 (only meaningful when the scope of the model includes X=0)
ESTIMATION OF REGR FUNC BY THE METHOD OF OLS (Ordinary Least Square)

The OLS method consists in finding estimates b0 and b1 of 0 and 1 which minimize
the quantity Q, which is the sum of the squared deviations ei = Yi - (  + 1Xi) of Yi
from its expected value.
(1.8) Q = i=1n (Yi - 0 - 1Xi)2
 Q can be minimized by: (1) a numerical search using a "grid" of values for b0 and b1 (2)
deriving an analytical solution, which is possible in this case. It can be shown that b0
and b1 that minimize Q are solutions to the simultaneous equations called the normal
equations: (1.9a) Yi = nb0 + b1Xi
(1.9b) XiYi = b0Xi + b1Xi2
 One can calculate Yi, Xi, XiYi, and Xi2 from sample obs., and then solve the
normal equations for b0 and b1 using formula. Calculate b1 first, then b0.
b1 = ((Xi - X.)(Yi - Y.))/(Xi - X.)2
b0 = Y. - b1X. b0 & b1 are called "point
estimates" of 0 & 1(in contrast to "interval estimates" in connection w/ statistical inference)
 Properties of LS estimators: The Gauss-Markov theorem states that under the
assumptions of the regr model 1.1, b0 and b1 are the BLUE. The G-M theorem holds
even though the shape of the distribution of the errors is not specified (e.g., it needs not
be normal). (1)Best: sampling distributions of b0 & b1 have minimum variance among
all unbiased linear estimators (2)Linear: it can be shown that b0 & b1 are linear
functions of the observations Yi (3)Unbiased: i.e., E{b0}=0 & E{b1}=1 (4)Estimator
ESTIMATED MEAN RESPONSE: Regr func E{Y} = 0 + 1X is estimated as
^ Y = b0 + b1X where ^Y(Y hat) is the estimated regr func at level X of the IV. (^Y is also
called "predictor" or "estimate" of Y.) One can show as an extension of G-M theorem that ^Y is
also the BLUE of E{Y}. ^Yi is called the "fitted value" of Y for observation i.
RESIDUALS: The ith residual ei is the difference between Yi & ^Yi: ei=Yi-^Yi=Yi-b0-b1Xi
ei is not the same as  i:
ei = Yi - ^Yi (where ^Yi is a known fitted value)
i = Yi - E{Yi} (where E{Yi} is the unknown true expected value of Y)
Properties of fitted values & residuals:

ei = 0
ei2 is minimum (by LS)
Xiei = 0
^Yiei = 0
ANALYSIS OF VARIANCE (ANOVA) INTERPRETATION

The total variation in Yi, (Yi - Y.) can be decomposed into two components:
Y - Y.
=
Decomposition
of i
total deviation of Yi
total variation in Yi
from mean
^Yi - Y.
+
deviation of fitted
value from mean
Yi - ^Yi
deviation of Yi
from fitted value
One defines sums of squares corresponding to each deviation. Sums of squares.
Symbol SSTO
Formula (Yi - Y.)2
= SSR
= (^Yi - Y.)2
Name
total sum of squares
regr sum of squares
Meaning
total variation in Y
variation in Y "accounted
for" by regr line
+
+
SSE
(Yi - ^Yi)2
error sum of squares or
residual sum of squares
variation in Y around regr
line
This is actually a remarkable and non-obvious property that must be proven!
Alternative computational formulas are:
SSTO = Yi2 - nY.2
SSR = b12(Xi - X.)2

Degrees of Freedom(df):To each sum of squares correspond df. Df are additive.
n-1
= 1
+ (n - 2)
DF of ANOVA sums of squares
df for SSTO df for SSR
df for SSE
1 df lost estimating Y. (cf. sample variance s2) 1 df lost estimating b1 2 df lost estimating b0 & b1

Mean squares are the sums of squares divided by their respective df. (not additive.)
s2(Y) = SSTO/(n - 1)
MSR = SSR/1
MSE = SSE/(n - 2)
sample variance of Y
regr mean square
error mean square
MSE is an estimator of 2, the variance of .(This makes sense since MSE is the sum of the
squared residuals divided by the df of this sum, = n-2,.)It can be shown that E{MSE}=2.
SQRT(MSE), standard error of estimate, is an estimator of , the standard deviation of .
Source of variation Som of Squares
df Mean Square
E{MS}
ANOVA
SSR(^Yi - Y.)2 1 MSR=SSR/1
2 + 12(Xi - X.)2
Table
for Regr
simple
Error
SSE(Yi - ^Yi)2 n-2 MSE=SSE/(n-2) 2
Linear Regr Total
SSTO(Yi - Y.)2 n-1
DESCRIPTIVE MEASURES OF ASSOCIATION

Coeff of (Simple or Multiple) Determination or "R-square"
r2 = (SSTO - SSE)/SSTO = SSR/SSTO = 1 - SSE/SSTO where 0 <= r2 <= 1
Limiting cases:(1) if all observations on regr line, then SSE=0 & r 2=1 (2) if slope b1=0 then
SSR=0 & r2=0 (3)r2 can be interpreted as the proportion of the variation in Y "explained"
by the regr model. But "variation" refers to the sum of squared deviations, so variation
is not measured in linear units. So interpretation of r 2 is not as intuitive as it may seem.)
 Coeff of (Simple or Multiple) Correlation
r = +/- SQRT(r2) = ((Xi - X.)(Yi - Y.))/SQRT((Xi - X.)2(Yi - Y.)2)
The sign of r is that of b1. When r 2 is not 0 or 1, |r|>r 2 so that the psychological impact (or
propaganda value) of r is stronger than that of r 2 .
 The Standardized Regr Coeff b1* is calculated as: b1*=b1(sX/sY) (=r, in simple linear regr only)
Conversely: b1=b1*(sY/sX) (=r(sY/sX), in simple linear regr only) where sX(SQRT(2{X}) &
sY(SQRT(2{Y}) are the sample standard deviations of X & Y respectively. b1* indicates the
change in the mean of the probability distribution of Y (i.e.,E{Y}), measured in standard deviation of
Y units, associated w/ an increase in X of one standard deviation of X. (In multiple regr models, the
standardized coeffs permit comparisons of the relative effects of different IVs.)
[M2]NORMAL ERROR REGR MODEL & STAT. INFERENCE
NORMAL ERROR REGR MODEL

When the functional form of the probability distribution of i is specified (in this case it
is assumed normal), one can estimate the parameters 0, 1 and 2 by the maximum
likelihood method. In essence, the ML method "chooses as estimates those values of
the parameters that are most consistent with the sample data"
 ML Estimation of the Normal Regr Model
The ML estimates of 0, 1 are the values that maximize the likelihood func.
Parameter ML Estimator Remark
NOTE: maximum likelihood
derivations often use the
same as OLS
0
^0 = b0
logarithm (base e) of L rather
same
as
OLS
1
^1 = b1
than L itself.
2
^2 = SSE/n < OLS (MSE = SSE/(n-2))

1.
INFERENCE FOR 1:Inference concerning 1 (i.e., the testing hypotheses and
constructing confidence intervals) is based on the sampling distribution of b1. The
sampling distribution of b1 "refers to the different values of b1 that would be obtained
with repeated sampling when the levels of the independent variables X are held
constant from sample to sample." (NKNW p. 45)
Sampling Distribution of b1
STEP1:Sampling Distribution of b1:For normal err regr model, properties of sampling distribution of b 1
Mean: E{b1} = 1;
Variance: 2{b1} = 2 / (Xi - X.)2;
Functional form: ~ N so that b1 ~ N ( 1 , 2{b1}) (~ means "is distributed as")
where 2 denotes the variance of i. These properties follow from the fact that b1 is a linear
combination of the observations Yi. The sampling distribution of b1 is normal because the Yi
are normally distributed and b1 is therefore a linear combination of independent normal
RVs, which is normally distributed by Theorem A.40 in Appendix A. (The quantity (Xi X.)2 plays the same role as n in the sampling distribution of the sample mean. The greater
the variance of X, the smaller the variance 2{b1} of the sampling distribution, and the more
precise the estimation of 1 .In experimental research one may be able to space the X values
optimally to achieve the smallest 2{b1} possible)
STEP 2: Sampling Distribution of (b1 - 1)/ {b1} :(b1 - 1)/ {b1} ~ N (0, 1) This follows
from the sampling distribution of b1 by the equivalent of the z transformation z = (X - )/
STEP 3: Sampling Distribution of (b1 - 1)/ s{b1} : In practice,  is not known so one
replaces it with the sample estimate s{b1}, the square root of:
s2 {b1} = MSE / (Xi - X.)2
 When {b1} is replaced by the sample estimate s{b1}, the sampling distribution is no
longer normal. It becomes a Student t distribution with (n-2) df (because 2 parameters
were estimated):
(2.10) (b1 - 1)/ s{b1}~ t (n - 2)
 (Because s{b1}is a RV rather than a fixed constant. This "extra randomness" is
manifested in the thicker tails of the t distribution as compared to the normal. This is
the same phenomenon as for the sampling distribution of the sample mean.)
2. Hypothesis Testing for 1
Two- H0: 1 = 0 ("null hypothesis")
H1: 1<> 0 ("research hypothesis")
sided t* = ( b1 - 0)/ s{b1} = b1 / s{b1}
test.
if |t*| <= t(1-/2; n-2), conclude H0;
if |t*| > t(1-/2; n-2), conclude H1
 EX: in FLE, choose = .05, b1 = 0.377, s{b1}=0.014, t* = 0.377/0.014 = 26.408. t
(0.975; 129) = 1.960 (from Table B.2) |t*| = 26.408 1.96 therefore conclude H1 (1< 0):
"There is a sig linear association between female life expectancy and the literacy rate",
or "the coeff of literacy is sig different from 0".
 In practice one uses the 2-tailed (aka "2-sided") P-value associated with b1. The 2-tailed
P-value is P{|t(n - 2)| t* = 26.408} ~= .000. One concludes H1 if the 2-tailed P-value
shown on the regr printout is less than  H0 otherwiseHere the P-value is 0.000 <
.05, so conclude H1 (1< 0).
One-sided test
H0: 1 <= 0
H1: 1 > 0
t* = ( b1 - 0)/ s{b1} = b1 / s{b1}
Hint: Always write down H1 (the
"research hypothesis") first;
if |t*| <= t(1-; n-2), conclude H0;
then H0 is the complement of H1
if |t*| t(1-; n-2), conclude H1
 EX: choose = .05; t(1-; n-2)=t(0.95;129)=1.645; t* = 26.408 1.645, therefore
conclude H11 0): "The coeff of the literacy rate is sig greater than 0".
 In practice one calculates the 1-tailed (aka "1-sided") P-value associated with b1 by
dividing the (2-tailed) P-value shown on the regr printout by 2. If the 1-tailed Pvalue is less than .05 one concludes H1 (1 0) Therefore, a 1-tailed test is "easier"
(i.e., more likely to turn up "significant") than a 2-sided test! It is legitimate to use a
1-tailed test whenever one has a genuine directional hypothesis concerning 1.
However, some puritanical statisticians frown at the use of 1-sided tests.
3. Confidence Interval (CI) for 1 : confidence interval b1 +/- t(1-/2; n-2)s{b1}
 Confidence intervals for 1 are again based on the sampling distribution
(2.10) (b1 - 1)/ s{b1}~ t (n - 2)
It follows that P{t(/2; n-2) <= (b1 - 1)/ s{b1}
<= t(1- /2; n-2)} = 1-where 1- is the confidence coeff. From which one can derive
the (1-) confidence interval b1 +/- t(1-/2; n-2)s{b1}
 EX: find the .95 CI for 1 in the female life expectancy example. t(1-/2; n-2) = t(.975;
129) = 1.96 Therefore CI is 0.377 +/- (1.96)(0.014) or (0.350, 0.404).
 Relation of CI with Hypothesis Tests: verifying that the (1- CI for b1does not include
the value 0 is equivalent to finding b1 sig in a direct 2-sided t test of b1.
 INFERENCE FOR : This is done in exactly the same way as for
 F TEST OF ENTIRE REGR MODEL : One can show that
E{MSE} = 2 E{MSR} = 2 + 12 (Xi - X.)2
Note that if 1=0, E{MSR} =
E{MSE} = 2. Therefore, an alternative to the 2-sided hypothesis test for 1 (H0: 1 = 0
; H1: 1< 0) is the F test F* = MSR/MSE If 1=0, E{F*}=2/2=1; if 1<0, E{F*}>1.
Thus the larger F*, the more likely that 1<0. If H0 holds it can be shown that F* ~
(2(1)/1 )/2(n-2)/(n-2))=F(1;n-2) The decision rule is
if F* <= F(1- ; 1, n-2), conclude H0
if F* F(1- ; 1, n-2), conclude H1
EX: MSR/MSE=19,674.177/28.212=697.369~=697.363. From Table, F(.95;1,129) = 3.84. F* =
697.363 3.84 so one concludes that b1 is (rather massively) significant.
 In this case the F test adds nothing to the t-test of b1. But in a multiple regr model the F test tests
the significance of all the regr coefficients simultaneously. The only difference is that in the
multiple regr case F* ~ F(p -1; n-p), where p -1 is the number of independent variables other than
the intercept.
 Equivalence of t Test and F Test :In the simple linear regr model, F*=(t*) 2. EX:in FLE example,
(26.408)2=697.382 ~=697.363. This is not the case in the multiple regr model.
 INTERVAL ESTIMATION (CI) FOR THE MEAN RESPONSE E{Yh}
 For a given level Xh of X (which does not necessarily correspond to a data point),
E{Yh} is estimated as
^Yh = b0 + b1Xh
 The Sampling Distribution of ^Yh has properties (1) Mean: E{^Yh} = E{Yh}
(unbiasedness) (2) Variance: 2{^Yh} = 2 ((1/n) + (Xh - X.)2 / (Xi - X.)2) (3)
Functional form: ^Yh ~ N (because ^Yh is a linear combination of the Yi)
 The variance of ^Yh depends on the distance of Xh from the mean X. of X. This is
because given a change in b1, the change in ^Yh is larger further away from the mean. It
follows that
(^Yh - E{Yh}) / {^Yh} ~ N (0, 1)
 In practice, one estimates unknown 2{^Yh} as
s2{^Yh} = MSE ((1/n) + (Xh - X.)2 / (Xi - X.)2 ) So that (^Yh - E{Yh}) / s{^Yh} ~ t (n-2)
 CI for a Single ^Yh: CI for ^Yh is
^Yh +/- t(1- /2; n-2)s{^Yh}
 EX: To calc. CI one needs info not contained in standard regr output. X . =59.366; 1/(Xi-X.)2 can
be estimated as 1/(s2{X}(n1))=1/((1064.572)(130))=7.225728(10-6) Choose Xh =70% literacy.
Then ^Yh =62.602 & s2{^Yh}=0.48831. 95% CI is 62.602 +/- (1.96)(0.48831)=(61.645; 63.559)
 CI for Several ^Yh and the Working-Hotelling Confidence Band. Calculating CIs for
several ^Yh creates problems of multiple tests, necessitating the use of Bonferroni-type
techniques. Alternatively, the W-H approach can be used to construct a confidence
band for the entire regr line. The Working-Hotelling 1- confidence band for the regr
line at any level Xh has the upper and lower values
^Yh +/- Ws{^Yh} where W2 = 2F(1- ; 2, n-2)
 It can be shown that W is slightly wider than the corresponding t(1- /2; n-2) of the
formula for a single Xh. W-H band can be used for any number of Xh levels.
 PREDICTION INTERVAL (PI) FOR A NEW OBSERVATION Yh(new)
 PI are esp important for quality control in an industrial context. Only the general
principle is important for us. (EX: predict the life expectancy of a specific country not
in the sample from estimated regr on literacy)
 Variation in Yh(new) is affected by two sources of variation: (1) variation in the estimated
mean of the distribution of Y given Xh and (2) variation within the probability
distribution of Y around its mean given Xh Therefore,
2{Yh(new)} = 2 + 2{^Yh} estimated as
s2{Yh(new)} = MSE + s2{^Yh}
 Thus the estimated variance of the predicted Yh(new) is
s2{Yh(new)} = MSE (1 + (1/n) + (Xh - X.)2 / (Xi - X.)2 )
as compared with
s2{^Yh} = MSE ((1/n) + (Xh - X.)2 / (Xi - X.)2 )
MODULE 3 - DIAGNOSTICS & REMEDIES
PROBLEMS TO BE INVESTIGATED WITH RESIDUAL ANALYSIS
"Problems" are departures from the assumptions of the regr model. Such as: (1) Regr func is not linear
(2) Error terms do not have constant variance (3) Error terms are not independent (4) Model fits all but
one or a few outlier observations (5) Error terms are not normally distributed (6) One or several
important predictor variables have been omitted from model. These problems also affect multiple regr
models. Most of the diagnostic tools used for simple linear regr are also used with multiple regr.
PLOTS OF RESIDUALS USED FOR DIAGNOSTIC

Plots typically can reveal more than one problem, so there is no one-to-one correspondence
between plot types and the problems listed above. (1)Residual e by predictor var X, (2)Residual e
by fitted value (or "estimate") ^Y : RESIDUAL/ESTIMATE, STUDENT/ESTIMATE,
STUDENT/RESIDUAL, (3) Absolute or squared residual e by predictor var X, (4) Residual e by
time or other sequence, (5) Residual e by omitted predictor var Z, (6)Box plot, stem-and-leaf, &
ther plots of the distribution of residual e, (7) Normal probability plot of residual e.

The normal probability plot (7) is used to examine whether the residuals are normally distributed.
It is a plot of the residuals against their expected values assuming normality. The expected values
of the residuals are obtained by first sorting them in increasing order of magnitude. Then
EXPECTEDk =z((k-0.5)/n) SQRT(MSE) for k=1,...,n where k is the rank of the residual & ((k0.5)/n) is the 100((k - 0.5)/n) percentile of the standard normal distribution N(0,1). When the
residuals are normally distributed, the plot is approximately a straight line. (A)Especially for
relatively small samples, experience is needed to assess visually the linearity of the plot. A formal
correlation test (explained later) can be used. Alternatively, Wilkinson, Blank, & Gruber (1996)
suggest to create probability plots for 10 random samples of size n from a normal population, &
judge visually whether the empirical plot 'fits' w/in this collection(B)some computer implementation of the normal probability plot do not multiply by SQRT(MSE) in calculating the expected
residuals, which does not affect the linearity of the plot; then the expected residuals are z scores.
DIAGNOSTIC TESTS FOR RESIDUALS
1. Correlation Test for Normality is based on the normal probability plot. It is the correlation between
the residuals ei and their expected values under normality. The higher the correlation, the "straighter"
the normal probability plot, and the more likely that the residuals are normally distributed. The value
of the correlation coeff can be tested, given the sample size and the a-level chosen, by comparing it
with the critical value in Table B.6 in NKNW. A coeff larger than the critical value supports the
conclusion that the error terms are normally distributed. EX: For the Yule data, the correlation
between RESPRED and RESIDUAL corresponding to the normal probability plot is .981 with n=32.
Table B.6 gives the critical values for n=30 of .964 with =.05. Since r=.981 .964 we have no reason
to reject the hypothesis that the residuals are normally distributed.
2. Tests for Constancy of 2 (Homoskedasticity) (1) homoskedasticity: 2 is constant over the entire
range of X (2) heteroskedasticity: 2 is not constant over the entire range of X

Two formal tests of constancy of the error variance are commonly used.
(1) the Modified Levene Test (2) the Breusch-Pagan Test
 I only present the Breusch-Pagan test. This is a large-sample test that assumes that the logarithm
of the variance 2 of the error term ei is a linear func of X. The B-P test statistic is the quantity
2BP = (SSR*/2) / (SSE/n)2
where SSR* is the regr sum of squares of
the regr of e2 on X SSE is the error sum of squares of the regr of Y on X. When n is
sufficiently large and 2 is constant, 2BP is distributed as a chi-square distribution with
1 df. Large values of 2BP lead to the conclusion that 2 is not constant.
3. Linearity (or "Lack of Fit") Test: This test, which checks for linearity of the regr func, requires
repeat observations (called replications) at one or more X levels, so it cannot be performed with
all data sets. In essence the test for lack of fits tests whether the means of Y for the groups of
replicates are sig different from the fitted value ^Y on the regr line, using a kind of F test.
REMEDIAL MEASURES:
 Data Transformations
1. Standardizing a var: Common var standardizations are the z-score and the range
standardization. These transformations do not affect the shape of the distribution of the var
(contrary to some popular beliefs). The standardized value Z of Y is obtained by the formula
Z = (Y - Y.) / sY
where Y. and sY denote the sample mean and
standard deviation of Y, respectively. The range-standardized value W of Y is obtained by
the formula
W = (Y - YMIN) / (YMAX - YMIN)
where YMAX and YMIN denote
the maximum and minimum values of Y
2. Transforming a var to look normally distributed: Transformations can be used to remedy nonnormality of the error term. These transformations often also take care of unequal error variance that
tends to go together w/ non-normality of the error term. Transformations of Y are appropriate when
the error variance does not appear constant, as in the following situations:
 Radical normalization with the rankit transformation: It transforms any distribution into a normal
one. It is the same as "grading on the curve".
 Tukey's lader of powers: Tukey (1977) has proposed a "ladder of powers" corresponding to a
continuum of transformations to normalize the distribution of a sample.
(Yp, .., Y2, Y1, Y1/2, LOG(Y), -Y-1/2, -Y-1, -Y-2,… -Y-p)
 Automatic choice of ladder of powers transformation with the Box-Cox procedure: The
family of power trasnformations in Tukey's "ladder" is of the form Y' = Ywhere l is
the power parameter. The Box-Cox procedure estimates the parameter l by maximum
likelihood together with the parameters of the regr models. The complete model is thus
Yi = 0 + 1Xi + i. The estimates , 0 , 1 and 2 can be found by maximizing this
func. Alternatively, one can do a numerical search over a range of values for .

Arcsine transformation for proportions and percents: Proportions and percentages often produce a
truncated and skewed distribution when the mean is not near .5. The arcsine transformation
corrects these problems. ynew=2*asn(sqr(P)) where P is a proportion. (Transform %'s to
proportions by dividing by 100 before applying transformation.)
 Fisher's z transformation for correlation coeffs: Correlation coeffs often produce a truncated &
skewed distribution when the mean is not near 0. Fisher's z transformation (aka hyperbolic
tangent) normalizes such distributions. ynew=ath(R) where R is a correlation between -1 & +1.
3. Transforming vars to linearize a relationship
When the error variance is constant, only X needs be transformed
4. Mathematical and statistical functions available in SYSTAT

Weighted Least Squares: In rare cases where data transformations are not sufficient to take care
of unequal error variance, one can use weighted least squares. In weighted least squares
observations are weighted in inverse proportion to the variance of the corresponding error term,
so that observations with high variance are downweighted relative to observations with low
variance. We will discuss these techniques in the context of multiple regr.

Robust Non-Parametric Regr with LOWESS
Module 4 - MATRIX REPRESENTATION OF REGR MODEL

Using matrices we can represent the simple and multiple regr model in a compact form.
(1) Definition of Matrix :A matrix is a rectangular array of elements arranged in rows & columns Or,
A=[aij] for i=1, 2, 3; j=1, 2 (In aij the 1st subscript always refers to the row index & the 2nd
subscript to the column index.) (2) A square matrix is a matrix w/ the same number of rows &
columns. (3) A (column)[row] vector is a matrix w/ one column[row]. "Vector" alone refers to a
column vector (4) The transpose of a matrix A=[aij] is the matrix A' =[aji] in which the row &
column indexes have been exchanged. Alternative notations (A' is AT). (5)Diagonal Matrix:
Only diagonal elements are non-0. (6) Identity Matrix I3x3. Diagonal is 1 everything else is 1. In
general AI=IA=A, so that I can be dropped in simplifying exprs. (7)A scalar matrix is a diagonal
matrix w/ all diagonal elements the same. Multiplying by I is the same as multiplying by . (8)
Zero Vector: A vector 0 composed entirely of zeroes.

Vectors and Matrix with All Elements Unity:(1) lrx1 is a column vector with all elements 1 (2)
Jrxr is a square matrix with all elements 1. These matrices are very useful for representing sums
& means in matrix notation. EX: l'l=n (where l is n x 1); ll' = J(where l is nx1 and J is nxn)
LINEAR DEPENDENCE & RANK OF A MATRIX

c column vectors Ci are linearly dependent when c scalars 1, ..., c, not all zero, can be found
such that 1C1 + 2C2 + ... + cCc = 0 (this 0 is a column vector) If this only holds when all
the 's are zero, then the column vectors are linearly independent. The columns of A are linearly
dependent when one column can be expressed as a linear combination of the others

To be able to estimate the regr coeffs in multiple regr the columns of the matrix X of IVs must be
linearly independent. (The rows of X do not have to be linearly independent, so that some of the
observations may have the same values of the IVs. But there must be more linearly independent
rows than the number of IVs, including the constant.)

Rank of Matrix: the max number of linearly independent columns in the matrix.
INVERSE OF A MATRIX: (1) The inverse of A, if it exists, is defined as another matrix A-1 such
that A-1A = AA-1 = I where I is the identity matrix. (2) The inverse of a square rxr matrix exists
if the rank of A=r.  A is nonsingular.(3) Computation of Inverse: Inverses are calculated by the
computer using algorithms called Gaussian elimination or LU decomposition. (4)Use of Inverse:
A system of linear equations such as AY = C is solved by premultiplying both sides by the
inverse A-1 (assuming A-1 exists)

A-1AY = A-1C so that Y = A-1C is the solution (because A-1A simplifies to I)
(A+B)+C=A+(B+C)
(ABC)'=C'B'A'
(A')'=A
THEOREMS
A+B=B+A
C(A+B)=CA+CB
(ABC)-1=C-1B-1A-1
(A-1)-1=A
(AB)C=A(BC)
(A+B)'=A'+B'
(A')-1=(A-1)'
 (A+B)=A+B
RANDOM VECTORS & MATRICES:Vectors (matrices) containing elements that are random vars

Expectation of Random Vectors or Matrix is the vector(matrix) whose elements are the
expectations of the RVs that are the elements of the vector (matrix).



In general, for a nxp matrix Y, E{Y} = [E{Yij}] for i = 1,...,n and j = 1,...,p.
EX: (1) Y is a random vector, Y' = [Y1 Y2 Y3] The expected value of Y is the vector E{Y}, such
that E{Y}' = [E{Y1} E{Y2} E{Y3}] (2) EX: In the regr model  is the vector of errors, so that
' = [1 2 ... n].Then E{} = 0nx1 so that E{}'=[E{1} E{2} ... E{n}]=[0 0 ... 0].
Variance-Covariance Matrix of Random Vector denoted 2{Y} of a random vector Y is the
matrix containing the variances and covariances of the elements of Y. (distinguish the matrix
2{Y} from the scalar 2{Y}.)
EX: for Y' = [Y1 Y2 Y3], 2{Y} =
2{Y1}
{Y1,Y2} {Y1,Y3}

{Y2,Y1} 2{Y2}
{Y2,Y3}

{Y3,Y1} {Y3,Y2} 2{Y3}
 The matrix is structured so that 




EX: for Y' =
2{Y1}
{Y2,Y1}
...
{Yn,Y1}
[Y1 Y2 ... Yn], 2{Y} =
...
{Y1,Y2}
{Y1,Yn}
...
2{Y2}
{Y2,Yn}
...
...
...
{Yn,Y2}
...
2{Yn}
(1) variances are on the main diagonal (2) the covariance of elements i and j of Y are in the ith
row and jth column (3) the matrix 2{Y} is symmetric because {Yi, Yj} = {Yj, Yi} for all i<j
It can be shown that the variance-covariance matrix can be represented in matrix notation as
2{Y} = E{[Y - E{Y}] [Y - E{Y}]'} Results follow from the abstract definitions of the
variance & covariance of RVs: 2{Y}=E{(Y-E{Y})2} & {Y, Z}=E{(Y - E{Y})(Z - E{Z})}
EX: The regr model assumes that the error terms have constant
variance, 2{i} = 2 , and are uncorrelated, { i , j } = 0.
Therefore 2{}=E{'}=2I=
Convince yourself that E{'} really yields the matrix.
2
0
...
0
0
2
...
0
...
...
...
...
0
0
...
2
Basic Theorems on Random Vectors & Matrices: Y is a random vector, A a constant matrix, &
W=AY. Then (1) E{A}=A (2) E{W}=E{AY}=AE{Y} (3) 2{W}=2{AY}=A2{Y}A'
where 2{Y} is the variance-covariance matrix of Y. These formulas are the equivalent for
random vectors of the rules for the expectation & variance of a linear func of a random var Y
LINEAR REGR MODEL IN MATRIX TERMS
 The simple linear regr model is Yi = 0 + 1Xi + i i = 1,..., n so that for the entire data set








we have Y1 = 0 + 1X1 + 1;
Y2 = 0 + 1X2 + 2 ; ...; Yn = 0 + 1Xn + n
1 X1
Defining vectors and matrices Y, X, , and  such that
1 X2
Y' = [Y1 Y2 ... Yn],
'= [1 2 ... n],
' = [0 1] and
X=
... ...
the regr model for the entire data set can be written Ynx1 = Xnx2  2x1 + nx1
The GLRM can be written Yi =0 + 1Xi + 2Xi2 + ... + p-1Xi,p-1 + i i = 1,..., n 1 Xn
where there are p-1 IVs Xi1 to Xi,p-1 plus the constant X0, for a total 1 X11 X12 ... X1,p-1
1 X21 X22 ... X2,p-1
of p variables on the right hand side of the equation.
... ... ...
...
Defining Y &  as before, and ' = 0 1 ... p-1] and X=
the regr model for the entire data set can be written Ynx1=Xnxppx1+ nx1 1 Xn1 Xn2 ... Xn,p-1
(1) Y is a vector of responses; (2)  is a vector of parameters; (3)X is a matrix of constants; (4) 
is a vector of independent normal random variables
The only difference between SLRM & GLRM is the dimensions of X & of .
assumptions concerning  are that E{}=0 & the variance-covariance matrix 2{}= E{'} = 2I.

Then it follows that the random vector Y has expectation E{Y} = X and the variance-covariance
matrix of Y is the same as that of  so that 2{Y} = 2I.
LEAST SQUARES ESTIMATION of the REGR COEFFICIENTS



(1) The OLS estimator of  are the values of the parameters that minimize
Q =  i=1 to n (Yi - 0 - 1Xi - 2Xi2 - ... - p-1Xi,p-1)2 (2) It can be shown that the OLS
estimator b of  is the vector b' = [b0 b1 ... bp-1] that is the solution of the normal
equations
X'Xb = X'Y (3) The normal equations are obtained by setting the partial
derivatives of Q with respect to the regr coefficients equal to zero to minimize Q.
Obtaine he solution b that minimizes Q by premultiplying both sides by (X'X)-1,
(X'X)-1X'Xb = (X'X)-1X'Y (Since (X'X)-1X'X = I, and Ib = b,)
The fundamental formula of OLS:
b(px1)=(X'(pxn) X(nxp))-1X'(pxn) Y(nx1)
FITTED VALUES & RESIDUALS

Fitted Values (aka Predictors): The nx1 vector ^Y containing the fitted values,
^Y' = [ ^Y1 ^Y2 ... ^Yn] is calculated as
^Y = Xb= X(X'X)-1X'Y= HY
 The matrix H= X(X'X)-1X' is the hat matrix. H has the following properties
(1) H is square of dimension nxn and involves only the values of X (assumed to be
fixed constants). Therefore ^Y is seen as a linear combination of the observations Y.
(2) H is very important in outlier diagnostics, as discussed later. (The diagonal elements
hii of H measure the leverage of observation Yi on its own fitted value ^Yi.) (3) H is
idempotent, i.e., HH = H. (A glimpse at deep mathematical truth: an idempotent matrix
represents a projection of a point in space on a subspace. Idempotency reflects the fact
that once a point is projected on the subspace, the point stays there if projected again.)
 Residuals: The vector e contains the residuals ei = Yi - ^Yi so that
(1) e = Y - ^Y = Y - Xb
or (2) e = Y - ^Y = Y - HY
or (3) e = (I - H)Y
 Like H, the matrix (I - H) is n x n, symmetric and idempotent.
 Variance-covariance Matrix of Residuals: It can be shown that the variance-covariance
matrix of e is 2{e} = 2(I - H) and is estimated as
s2{e} = MSE(I - H)
ANALYSIS OF VARIANCE : Sums of Squares: ANOVA Table in Matrix Form
Source
of Scalar
Standard
Matrix Quadratic Matrix
Intermediate Result
df MS
Variation
Notation
Form
Form
SSR
SSR=
SSR
=
MSR=
Regr
p-1
SSR/(p-1)
b'X'Y - (1/n)Y'JY Y'[H - (1/n)J]Y
(^Yi - Y.)2
SSE
SSE = e'e =
SSE=
SSE
=
MSE=
Error
n
p
SSE/(n-p)
Y'Y - b'X'Y
Y'(I - H)Y
(Yi - ^Yi)2 (Y- Xb)'(Y - Xb)
SSTO SSTO
SSTO=
SSTO
=
Var(Y)=
Total
n-1
2
2
2
SSTO/(n-1)
Y'Y - (1/n)Y'JY Y'[I - (1/n)J]Y
(Yi - Y.) Yi - (1/n)(Yi )
(1) to derive the quadratic forms, note that b'X' = (Xb)' = ^Y' = (HY)' = Y'H; replace b'X' by
Y'H; then Y' and Y can be factored out as premultiplying and postmultiplying the expression inside
(2) the df corresponding to the sums of squares now involve p, the total number of independent
variables in the model including the constant term (3) one can verify that Y'JY is equal to (Yi )2
with a 2x2 matrix (4) the quadratic expr are all of the form Y'AY where A is a symmetric matrix




Sums of Squares as Quadratic Forms: A quadratic form is an expression of the form
Y'AY where A is a symmetric matrix. Then Y'AY = i=1 to n i=1 to n aijYiYj where aij = aji
and Y'AY is 1x1 (a scalar). One can verify that Y'AY is a second-degree polynomial
involving the squares and cross products of the observations Yi.
5 3
EX:5Y12 + 6Y1Y2 + 4Y22 can be represented as Y'AY where
3 4
Y' = [Y1 Y2] and A (A is called the matrix of the quadratic form) =
All sums of squares in ANOVA for linear statistical models can be expressed as
quadratic forms.(nxn quadratic forms represent distances in a n-dimensional space)
F Test for Regr Relation: F test is used to test whether the p-1 regr coeffs are all equal to 0.
The setup is
H0: b1 = b2 = ... = bp-1 = 0
H1: not all bk (k = 1,...,p-1) equal zero
 The test is the same as for simple regr except for the df..
 F* = MSR/MSE
The decision rule is
 if F* <= F(1- ; p-1, n-p), conclude H0
if F* F(1- ; p-1, n-p), conclude H1
 Coeff of Multiple Determination (R2=SSR/SSTO = 1 - SSE/SSTO (same)
The coeff of multiple correlation r is the square root of R2 and is always positive. The
adjusted coeff of multiple determination or adjusted R2 corrects for the number of predictor
variables by dividing each sum of squares by its associated df:
Ra2=1 - (SSE/(n - p))/(SSTO/(n - 1)) = 1 - ((n - 1)/(n - p)) SSE/SSTO If it is adjusted
it must be better! Note as n becomes large the difference between R2 and Ra2 vanishes.
INFERENCE ON REGR COEFFICIENTS
 The OLS & ML estimators of the coefficients in b are unbiased, so
E{b} = b
 Note that b is a random vector, since it is estimated from a sample. Thus it has a
variance-covariance matrix. It is pxp matrix 2{b} = 2 (X'X)-1.(var-covar of b)
 The estimated variance-covariance matrix of b is given by
s2{b} = MSE (X'X)-1
 To derive expr one notes that b = AY with A = (X'X)-1X'. Thus b is a linear func of Y.
 Thus the variance of the kth regr coeff s2{bk} is MSE times the kth diagonal element of
(X'X)-1. The standard error of estimate s{bk} is the square root of s2{bk}. With the
normal error regr model, the sampling distribution of bk is such that
( bk - k )/ s{bk} ~ t (n - p) for k = 1, ... , p - 1
 Hypothesis Tests for bk : One-tailed tests are done in a similar way.
Two-tailed setup is H0: k = 0 H1: k < 0
The test statistic t* is t* = bk/s{bk}
The decision rule is: If |t*| <= t(1-/2; n-p) conclude H0 if |t*| t(1-/2; n-p) conclude H1.
OR conclude H1 if the 2-tailed P-value P{|t(n - p)| t*} < , conclude H0 otherwise
 CI for bk :Confidence limits for bk w/ confidence coeff 1- are bk+/- t (1-/2; n-p) s{bk}
 CI for Mean Response E{Yh}: Given the vector Xh of specific values of the IVs such that
Xh'=[1 Xh1 Xh2 ... Xh,p-1] the mean response is E{Yh} = Xh'& is estimated as ^Yh=Xh'b
 The expectation and variance of ^Yh are
E{^Yh} = Xh'b = E{Yh} (unbiasedness)
2{^Yh} = 2Xh'(X'X)-1Xh = Xh'2{b}Xh
 The variance of^Yh is estimated as
s2{^Yh} = MSE(Xh'(X'X)-1Xh) = Xh' s2{b}Xh
2
so that s (^Yh} is func of the variances & covariances among estimated coefficients.
 The 1 -  CI for E{Yh} is
^Yh +/- t(1 - /2; n - p) s{^Yh}
 Working-Hotelling Confidence Region for Mean Response: To estimate a 1 -  confidence


region around the regr "surface" or test multiple mean responses, one can use the WH confidence
region ^Yh +/- Ws{^Yh}
where W2 = pF(1 - ; p, n - p)
Prediction interval for new observation Yh(new):1- prediction interval for Yh(new)corresponding to
 In Constructing Social Theories, Arthur Stinchcombe writes "A causal law is a statement or
proposition in a theory which says that there exist environments ... in which a change in the
value of one var is associated with a change in the value of another var and can produce this
change without any change in other variables in the environment" (p. 31). Such a causal
theory can be represented schematically as
X (independent) -- Y (dependent)
 Stinchcombe argues further that to support or refute a causal theory we must: (1) observe
diff values of X (i.e. causal analysis is inherently comparative, in that one must have data on
several cases with different values of the IV; this emphasizes the inherent limitation of case
studies in establishing general causal patterns) (2) observe covariation (i.e., an association
between X and Y) (3) ascertain the direction of causality (4) ascertain nonspuriousness (i.e.,
whether one or more other variables affect both X and Y and thereby produce an apparent
association between X and Y that is spuriously attributed to a causal influence of X on Y)
 Multiple regr analysis can be used to help achieve points 2(measuring association) &
4(ascertain nonspuriousness). More advanced techniques (involving nonrecursive models)
can help with point 3(ascertaining the direction of causality).
 Ascertaining nonspuriousness is equivalent to eliminating alternative hypotheses on the
source of the relationship between X & Y. In the context of regr analysis, spuriousness
(=alternative explanations of the oberved association between X & Y) is called specification
bias. Specification bias is a more general and continuous notion than spuriousness. The idea
is that if a regr model of Y on X excludes a var that is both associated w/ X & a cause of Y
(the model is then called misspecified) the estimated association of Y w/ X will be inflated
(or, conversely, deflated) relative to its true value. The regr estimator, in a sense, falsely
"attributes" to X a causal influence that is in reality due to the omitted var(s). An example
illustrating how multiple regr can be used to test for nonspuriousness. (D-score by sex &
scatter plot of D-score by age & sex)

The Elaboration Model & the Standard Tabular Presentation of Regr Results
Table below shows the standard representation of the elaboration model using regr analysis.
"Unstandardized Regr Coeffs of Cognitive Development (D-score) on Sex & Age for 12 Children
Aged 3 to 10 (t Ratios in Parentheses)"
 the title of the table contains information on the type of regr coeff's shown (here, unstandardized
coeffs), the DV, the IV's (when there are too many to list in the title, say "on selected IV's"), the
nature of the units of observation (children aged 3 to 10), & the sample size (12); when n is not
the same in all the models (e.g. because of missing data), state the maximum n in the title &
specify the actual sample sizes in a row labeled "N" below "Adjusted R-square"
 the IV are introduced one at a time in successive models shown in the different columns of the
table; variants of this strategy often introduce together sets of related vars, like
1. a set of indicators representing a Independent var
Model 1
Model 2
categorical var
Constant
10.305*** 6.927***
2. different
powers
of
X
(15.109)
(13.697)
representing a polynomial func Boy ( boy=1, girl=0)
2.288*
-.126
3. variables related substantively
(2.372)
(-.262)
(EX: father's education, mother's Age (years)
-.753***
education, and family income as
(7.775)
indicators of family SES)
R-square
.360
.917
Note the following points:
Adjusted R-square
.296
.899
Note:*p< .05 **p< .01 ***p < .001(2-tailed tests)

significance levels of the coeff's are indicated with asterisks. ASR convention is shown. A legend
at the bottom of the table indicates the meaning of the symbols & specifies the type of test (1- or
2-tailed). Both 2- & 1-tailed tests can be used in the same table by using a different symbol for 1tailed tests. EX: add a line at the bottom with: + p< .05 ++ p< .01 +++ p < .001 (1-tailed tests)
 both R-square & Adjusted R-square are shown; reviewers will often insist that you show the
adjusted R-square, even though N may be so large it makes no difference. (Never omit the
regular ("unadjusted") R-square, as this can be used to reconstruct F-tests from the table!)
 t-ratios(t-r) are shown in parentheses below the regr coeffs; it is better to present the t-r than the
standard error of estimate, as the t-r's are in the same metric (a Student t variate w/ n-p df) while
the standard errs are in the metric of the respective coeffs; therefore, in general, standard errs of
estimate cannot be compared & suffer different degrees of rounding err depending on the metric,
so that the t-r for some coeffs cannot be calculated from the table w/ acceptable precision.
 a place holder (--) is used to show that a var is not included in a model;
 the IVs are labeled with human readable text, not the computer symbol; use a name for the var
that is consistent with the numerical scale (e.g., SES must have values that are large for high SES
and small for low SES; values of "Democracy" must be high for democratic countries and low for
non-democratic ones); define indicator variables explicitly when any doubt is possible (e.g., an
indicator called SEX must specify whether MALE = 1 or FEMALE = 1; calling the indicator
MALE (with 1 for male and 0 for female) resolves this problem
GENERAL LINEAR MODEL: FIRST ORDER MODELS
 First Order Models W/ p-1 IV's the general linear regr model can be written
Yi = 0 + 1Xi + 2Xi2 + ... + p-1Xi,p-1 + i i = 1,..., n where there1 X11 X12 ... X1,p-1
are p-1 independent variables Xi1 to Xi,p-1 plus the constant X0, for 1 X21 X22 ... X2,p-1
a total of p variables on the right hand side of the equation.
... ... ...
...
1 Xn1 Xn2 ... Xn,p-1
Defining Y and  as before, and ' = [0 1 ... p-1] and X =
 the regr model for the entire data set can be written
Ynx1 = Xnxp px1 + nx1
 In the model (1) Y is a vector of responses (2)  is a vector of parameters (3) X is a
matrix of constants (4)  is a vector of independent normal random variables such that
E{} = 0 and the variance-covariance matrix 2{} = E{'} = 2I.
 It follows that random vector Y has expectation E{Y} = X and the variancecovariance matrix of Y is the same as that of  so that 2{Y} = 2I
 The response func. E{Y} = X = E{Y} = 0 + 1X1 + 2X2 + ... + p-1Xp-1






Xh is ^Yh+/-t(1-/2;n-p)s{pred} where s2{pred}=MSE+s2(^Yh}=MSE(1+ Xh'(X'X)-1Xh)
One can use the Bonferroni approach too for (1) confidence region and for (2) inference in
predicting the mean of m new observations or predicting g new observations.

Problem of Hidden Extrapolation in Estimating the Mean Response E{Yh}
Module 5 - GENERAL LINEAR MODEL
NEED FOR MODELS WITH MORE THAN ONE INDEPENDENT VAR

Motivations for Multiple Regr Analysis: The 2 principal motivations for models with more than
one IV are: (1) to make the predictions of the model more precise by adding other factors
believed to affect the dependent var; in technical terms, the goal is to reduce SSE (or to increase
the adjusted R-square). EX: in a model explaining prestige of current occupation as a func of
years of education, add SES of family of origin and IQ (2) to support a causal theory by
eliminating potential sources of spuriousness. Also called the elaboration model. EX: in a study
of productivity of work crews, include both size of crew and level of bonus pay

Supporting a Causal Theory by Eliminating Alternative Hypotheses



When the X's represent all diff predictors the model is called the first order model w/ p-1 vars
(the indexing of the vars starts at 0, w/ 0 indexing the intercept & p-1 indexing the last IV)
regr parameters are interpreted as follows(1) 0 is the mean response when X1=...= Xp-1=0 (which
is only interpretable when X1=...=Xp-1=0 is included in the scope of the model)(2)k is the change
in mean response E{Y} per unit increase in Xk when all the other predictors are held constant
The k's are sometimes called partial regr coeffs, but more often just regr coeffs, or
unstandardized regr coeffs (to distinguish them from standardized coeffs). (Mathematically, k
corresponds to the partial derivative of the response func w/ respect to Xk: E{Y}/Xk = k .)
Geometry of the First Order Multiple Regr Model
The response func(also called regr func or resp surface) defines a hyperplane in p-dimensional
space. When there are only 2 predictor vars (besides the constant) the response surface is a plane.
EX: first order model with 2 predictors with response func E{Y} = 10 + 2X1 + 5X2 where Y is
test market sales in 10K dollars, X1 is point-of-sales expenditures in K dollars, & X2 is TV
expenditures in K dollars. 1 = 2 means that, irrespective of the value of X2, increasing point-ofsales expenditures by 1K dollars increases sales by 20K dollars (2*10K). The parameter 2 is
interpreted similarly. In the first order model the effect of a var does not depend on the values of
the other vars. The effects are thus called additive or not interactive. The response func is a plane.
For example, if X2=2 it follows that E{Y}=10+2X1 + 5(2)=20+2X1, a straight line.
When there are more than 2 IVs (in addition to the constant) the regr func is a hyperplane & can
no longer be visualized in 3-dimensional space. (There is an alternative geometry for multiple
regr that represents the problem in n-dimensional space, where n is the number of observations.
Then the vector Y of observations on the DV & each vector Xk of observations on an IV
correspond to points in that n-dimensional space. In that representation, OLS estimates the
perpendicular projection of the vector Y on the subspace "spanned" by the vectors Xk)
Standard Regr Output for Multiple Regr EX:The DV is change in pauperism (PAUP). The
predictor of interest is the out-relief ratio(OUTRATIO). The control vars change in proportion
of the old(PROPOLD) & change in population (POP) are introduced into the model in turn.
Most elements of the multiple regr output are the same as in the simple regr:2 additional elements
1.
Standardized Regr Coefficients: bk* is calculated as:

2.



bk* = bk(s(Xk)/s(Y))
Conversely: bk = bk*(s(Y)/s(Xk))
where s(Xk) & s(Y) denote the sample standard deviations of Xk & Y, respectively.
(1) The standardized coeff b k* measures the change in standard deviations of Y associated
with an increase of one standard deviation if X. (2) Standardized coefficients permit
comparisons of the relative effects of different IVs measured in different metrics (= units).
(3) EX: In the full model with Yule's data, the standardized coeffs are OUTRATIO .584,
PROPOLD .031, POP -.570. The coeff of OUTRATIO means that a change of one standard
deviation unit in OUTRATIO is associated with a change of .584 standard deviations of
PAUP. The other coefficients are interpreted similarly. The coefficients show that the
effects of OUTRATIO & POP are strong & of comparable magnitude, although they are in
opposite directions (.584 & -.570) & that the effect of PROPOLD is negligible (.031).
Tolerance or Variance Inflation Factor: The standard multiple regr output often provides a
diagnostic measure of the collinearity of a predictor with the other predictors in the model,
either the tolerance or the variance inflation factor.
2
k

tolerance TOL = 1 - R where Rk is the R-square of the regr of Xk on the other p2 predictors in the regr & a constant. A TOL value close to 1 means that X k is not highly
correlated w/ the other predictors in the model. A value close to 0 means that Xk is highly
correlated w/ the other predictors; one then says that Xk is collinear w/ the other predictors.
VIF = (TOL)-1 = (1 - Rk2)-1 The variance inflation factor(VIF) is the inverse of the
TOL. Large values of VIF therefore indicate a high level of collinearity.
E{Y} w/ respect to X1 and X2, respectively: E{Y}/X1=1+3X2 E{Y}/X2=2+ 3X1
EX: compare the additive model (a) E{Y} = 10 + 2X1 + 5X2 to the interaction models (b)E{Y}=
10+2X1+5X2+0.5X1X2 :reinforcement effect(c)E{Y}=10+2X1+5X2-0.5X1X2 interference effect)
The plots of Y against X1 conditional on the value of X2 are called conditional effects plots.
Interaction effects can also be represented by drawing the response surface (Y as func of X1 &
X2) in 3-dimensions or using contour plots.(color, contour, xycut(B&W grid)

Plot a response surface & 3 representations of the interaction model(Prob 7.39 p.323)
Module 7 - QUALITATIVE INDEPENDENT VARIABLES
INDICATOR VARIABLES(IrV): An IrV (or "binary var") can have only two values, 0 or 1. IrVs
are used to represent qualitative(nominal or categorical) vars. A qualitative (nominal or
categorical) var w/ k classes is represented in the regr model by k-1 indicators. (Indicators are
also called "dummy vars" but this usage is unfortunate as it falsely implies that indicators are not
"real" vars in some sense.) Regr models with IrVs are very important in part because they
represent a bridge between regr analysis and a set of statistical techniques called analysis of
variance(ANOVA) & analysis of covariance(ANCOVA). It can be shown that all ANOVA or
ANCOVA models can be represented as regr models containing suitably defined indicators.

EX: in a study of the sales of restaurants, restaurant location is a qualitative var with 3 categories:
Highway, Shopping Mall, or Street. The indicators used are
X2 = 1 if Shopping Mall location, 0 otherwise X3 = 1 if Street location, 0 otherwise

Highway location does not have an indicator associated with it. Highway location is called the
"reference", "baseline", or "omitted" category or class. The reference class is the class for which
every indicator is set to zero. The restaurant data are set up as in the next exhibit.




2
A common rule of thumb is to take TOL < .1 or VIF 10 as an indication that collinearity
may unduly influence the results. EX: In the full model with Yule's data, the TOLs are
OUTRATIO .985, PROPOLD .711, POP .719. The corresponding VIFs are 1.015, 1.406,
and 1.391, respectively. These values do not indicate any problem with collinearity.

Options of the General Linear Model:The general linear model need not contain only first powers
of different predictors. The X's can also represent (1) different powers of a single var (polynomial
regr) (2) interaction terms represented as the product of two or more vars (3) qualitative
(categorical) vars represented by one or more indicators (vars w/ values 1 or 0, aka "dummy
variables") (4) mathematical transformations of vars (EX: use of polynomial expressions,
categorical vars, & mathematical transformations of a var w/in the general linear model.)
Module 6 - POLYNOMIAL REGR & INTERACTIONS
POLYNOMIAL REGR WITH ONE PREDICTOR VAR

A nonlinear relationship between Y & X can often be approxly represented w/in the general
linear model as a polynomial func of X.

EX:Yi=0+1Xi+2Xi2+i may be represented as Yi=0+1Xi1+2Xi2+i where Xi1=Xi, Xi2=Xi2

To estimate a polynomial func X is often first deviated from its mean (or median) to reduce
collinearity between X and higher powers of X. A var deviated from its mean is called centered.
The transformation is x = X - X. where x (lower case) represents the centered var.

A polynomial func can be used when (1)the true response func is polynomial (2)the true response
func is unknown but a polynomial is a good approx of its shape

Note the response func E{Y} for any polynomial model w/ one predictor var can be represented
on a 2-D plot of Y against X. A 2nd degree polynomial implies a parabolic relationship. The
signs of the coeffs determine the shape of the response func. When 2 is positive, Y increases as
the values of X inc. When 2 is negative, Y eventually decreases as the values of X inc.

EX: Kuznets curve postulates an inverted U-shaped relationship between income inequality &
economic development. This curvilinear relationship is often approx w/ a 2nd degree polynomial
(aka quadratic func). The, development of a country is measured as logged GDP per capita.

Higher degree polynomials produce curves with more inflection points.

When estimating a polynomial func, it is often useful to test for the joint significance of the
coefficients of X, X2, and highr powers of X, in addition to testing for the significance of each
coeff separately. In a joint test of significance one tests H0: 1=2= 0 against the alternative that
at least one of the coeff is not zero. We will see later how to do joint significance tests.

NOTES (1) when evaluating the shape of a polynomial response func, it is necessary to keep
within the range of X in the data, as extrapolating beyond this range may lead to misleading
predictions (2) it is possible to convert from the coefficients of the centered model (involving x)
to the non-centered model involving the original X, however, the conversion is rarely needed for
substantive purposes (3) fitting a polynomial regr with powers higher than three is rarely done as
the interpretation of the coefficients becomes difficult and interpolation tends to become erratic.
(A polynomial of order n-1 can always be fitted exactly to n points.) (4) polynomial regr models
are often fitted with the hierarchical approach in which higher powers are introduced one at a
time and tested for significance, and if a term of a high order is included (say, x 3) then all terms
of lower order (x and x2) are also included.
POLYNOMIAL REGR WITH MORE THAN ONE PREDICTOR

A second-order model with two predictors has the general response func
E{Y} = 0 + 1x1 + 2x2 + 11x12 + 22x22 + 12x1x2
where x1 = X1 - X.1 x2 = X2 - X.2

The x vars are centered. The indexing of the coeffs reflects the composition of the corresponding
term. The response func is a quadratic func of x1 & x2. The product term x1x2 represents the
interaction of x1 & x2. The coeff 12 therefore represents the effect of the interaction of x 1 & x2
on Y. The response func of a polynomial regr model of any order involving two predictors must
be represented in 3-dimensional space with axes Y, x1, and x2. The response func can then be
represented in perspective as a surface in 3-dimensional space or by contour cuves (similar to
level curves in topographical maps). Each contour curve represents the combinations of x1 and x2
that yield the same value of the response Y.

Polynomial models involving more than 2 predictors are possible but the response func can no
longer be represented in 3-dimensional space.
INTERACTION REGR MODELS

A regr model with p-1 predictors is called additive if the response func can be written in the form
E{Y}=f1(X1)+f2(X2)+...+fp-1(Xp-1) where f1, f2,...,fp-1 can be any func. Models that are not additive
contain interaction effects. Interactions are commonly represented as cross-product terms called
interaction terms.The simplest interaction model has resp func E{Y}=0+1X1+2X2+3X1X2

One may view this model as a degenerate form of a quadratic model, without the square terms.
The meaning of the regr coeffs 1 & 2 is not the same as it is in a model without interaction. In
the interaction model, the change in E{Y} with a unit increase in X1 when X2 is held constant is
1+3X2 and the change in E{Y} with a unit increase in X2 when X1 is held constant is 2 + 3X1
 Therefore, the effect of both X1 & X2 depends on the level of the other var(So that the regr model
is no longer additive.) NOTE: The effects of X1 & X2 are obtained by differentiating


The 2 indicators are included in the regr model
Yi = 0 + 1Xi1 + 2Xi2 + 3Xi3 + i




where Xi1 is Number of Households in the area (a regular continuous var), and Xi2 and
Xi3 are the two location indicators. To understand the meaning of the coefficients one
must examine the regr func E{Y} = 0 + 1X1 + 2X2 + 3X3
E{Y} = 0 + 1X1 + 2X2 + 3X3 Regr func
There areLocation
E{Y} = 0 + 1X1 +  (0)+ 3(0) E{Y} = 0 + 1X1
three cases, Highway
depending Shopping Mall E{Y} = 0 + 1X1 + 2(1) + 3(0) E{Y} = ( 0 + 2) + 1X1
on location Street
E{Y} = 0 + 1X1 + 2(0) + 3(1) E{Y} = (0 + 3) + 1X1
The table shows that 2 & 3 represent the differences in intercept for Shopping Mall and
Street location, respectively, relative to Highway (the omitted category). This can be seen by
plotting the regr func for the 3 locations.
A categorical var w/ k classes(categories) is represented by k-1 indicators, w/ 1 category omitted.
Using k indicators to represent a categorical var w/ k classes would make the X'X matrix
singular, so  cannot be estimated. In the following exp, units are insurance companies of 2
types:Stock or Mutual. X1:size of firm, X2:Stock Firm, X3:Mutual Firm.
The following exhibits show another example of the use of indicator variables, using
the insurance innovation data. The DV Y is elapsed time for an innovation to be
adopted by insurance companies. There are 10 mutual firms and 10 stock firms. The
continuous IV X1 is size of firm (in million dollars).
The estimated regr func is (t-ratios in parentheses)^Y=33.87407- .10174X1+8.05547 X2
with n=20; r-square = 1,504.41/1,680.80 = .895
(18.68)
(-11.44) (5.52)
The separate regr funcs for Mutual & Stock firms can be plotted w/ the actual data points, w/ the
type of firm identified by a symbol. This shows that the indicator model assumes (or constrains)
the effect of firm size to be the same for both types of firm, so the regr lines are parallel.
INDICATOR MODELS WITH INTERACTIONS

The coeff of a continuous IV (e.g.size of firm) may be allowed to vary as a func of an IrV (e.g.
firm type) by using an interaction term. In the insurance innovation example the response func
for the interaction model becomes E{Y}=0+1X1+2X2+3X1X2 where, as before, X1 is Size
of Firm and X2 is Stock Firm (1 if stock company, 0 otherwise). Again to understand the meaning
of the coeffs one must examine the regr (response) func for each category of the qualitative var.
Type of Firm
E{Y} = 0 + 1X1 + 2X2 + 3X1X2 Regr func
Mutual Firm (X2=0) E{Y} = 0 + 1X1 + 2(0) + 3(0)
E{Y} = 0 + 1X1
Stock Firm (X2=1)
E{Y} = 0 + 1X1 + 2(1) + 3X1(1) E{Y} = (0 + 2) + (1+ 3 )X1

Thus in the interaction model both the intercept & the slope of the continuous var X 1 differ across
types of firm. This can be seen by plotting the regr funcs for the two firm types. W/ a significant
interaction term, the two regr lines might look as follows

For these data, however, it is found that the estimate of 3 is non-sig (t* = -.02). We might have
expected the absence of a sig interaction from the appearance of the symbolic plot in NKNW
Figure 11.2 p. 460, as the trends for the two firm types seem roughly parallel. More complex
indicator models involving qualitative vars w/ more than 2 classes, more than 1 qualitative var,
w/ or w/out interaction terms are possible. They are interpreted in essentially the same way.
COMPARISON OF TWO OR MORE REGR FUNCTIONS  A very common research strategy is
to study the similarities & differences between regr models for 2 or more populations.

EX: a comparison of two production lines for making soap bars. For each production line, the
relation of interest is that between the amount of scrap for the day (the dependent var) and the
speed of the production line. A symbolic scatter plot of Amount of Scrap against Line Speed
suggests that the regr relation is not the same for the two production lines (next exhibit).

When it is reasonable to assume that the error term variances in the regr models for the different
populations are equal, one can use indicator vars to test the equality of the diff regr funcs. (When
the variances are not equal, a suitable transformation of Y can equalize them approximately.
NKNW p. 472) This is done by considering the diff populations as classes of a predictor var,
define indicator vars for the diff populations, & estimating a single regr model containing
appropriate interaction terms, similar to the insurance innovation interaction model above.

In the soap production data define an interaction model w/ regr func E{Y}=0+1X1+2X2+3X1X2
where Y = Amount of Scrap, X1 = Line Speed, & X2 is Line 1 (=1 if production line 1, 0 if
production line 2). The estimated regr model is (t-ratios in parentheses)


^Y=7.57+1.322X1+90.39X2- .1767X1X2 [b0 =(.36); b1=(14.27); b2 =(3.19); b3 =(-1.37)]
so one concludes that the slope of the relationship between Amount of Scrap & Speed does not
differ across production lines (b3 is non sig w/ t* = -1.37) but that the intercepts are significantly
different, so that Amount of Scrap is overall higher in Line 1 than in Line 2 (b2 is sig w/ t*=3.19).
(One can formally test the equality of the regr lines with the joint test of H 0: 2 = 3 = 0)
PIECEWISE LINEAR REGR & DISCONTINUITIES  Indicator vars can be used to model
situations in which the slope of the regr of Y on X differs for two ranges of X

Assuming that Xp (the point where the slopes change) is known, the piecewise linear relation
is represented by a regr model with response func E{Y}=0+ 1X1 + 2(X1 - 500)X2 where
Y=Unit Cost, X1 = Lot Size, X2 = 1 if X1> Xp =500, = 0 otherwise, & Xp = 500. One can
convince oneself that this func represents the piecewise linear relation by examining the response
func separately for the range X1<=500 & the range X1 500, as shown in the next exhibit

The model can be est. from the data. estmted regr func: ^Y=5.89545-00395X1-.00389(X1-500)X2

Piecewise linear modelling can be easily extended to more than 2 pieces. See NKNW p. 477.

When Xp is not known, one may (1) attempt to guess its position from the scatterplot (2) use nonlinear regr techniques to estimate Xp together with the other parameters of the model iteratively.

One can also use indicators to model discontinuous piecewise linear regr relations(NKNW p477)
INDICATORS VERSUS QUANTITATIVE VARIABLES
Indicators Versus Allocated Codes

Qualitative variables with ordinal categories can often be represented either by allocated codes or
by indicators. For example, persons of Hispanic origin might be asked the question "How often
do you use Spanish at home?" with response categories Frequently, Occasionally, Never. The
var may be coded with allocated codes in var X1, or with two indicators X2 and X3:
Class
X1 X2 X3

Alternative codings of frequency of Spanish use at home:
X1:allocated codes; X2(FrequentUser) & X3(OccasionalUser): indicators Never
1 0 0

Using allocated codes X1 in the model w/ response funcE{Y}=0+1X1Ocasionally 2 0 1
constrains differences in response func among classes to be the same:
Frequently 3 1 0

Using allocated codes X1 in the model with response func E{Y} = 0 + 1X1 constrains
differences in response func among classes to be the same:

Never: E{Y}=0 + 1. 1 Occasionally: E{Y} = 0 + 2. 1 Frequently: E{Y} = 0 + 3. 1

Therefore , E{Y|Frequently}-E{Y|Occasionally}=E{Y|Occasionally}-E{Y|Never} = 1

assumption of constant diff in effect among contiguous classes may not be substantively plausible.

Using indicators X2 & X3 in the model w/ response func E{Y}=0+X2+3X3 allows differences
in response funcs among classes to be different: (1)E{Y|Frequently}-E{Y|Never}= 
(2) E{Y|Occasional}-E{Y|Never}= (3) E{Y|Frequently}-E{Y|Occasional}=(2 -  3)

Thus in the indicators model the effects of the classes are not arbitrarily restricted.
Indicators Versus Quantitative Var : For the same reasons, it may be useful to use a set of indicators
rather than continuous values even when a var is inherently quantitative. EX: to study the
relationship of earnings with age, age in years is represented by a set of indicators corresponding
to 5-years categories 20-24, 25-29, 30-34, etc. Using indicators may allow for a better "tracking"
of the non-linear relationship between Y (earnings) and X (age). The disadvantage of indicators
is that more degrees of freedom are consumed (k-1 indicators versus 1 continuous var), but this is
not a problem with large data sets.
Alternative Coding for Indicators: Alternatives to the coding of a qualitative var with k-1 (0,1)
indicators and a constant term are (1) k-1 indicators coded (0,1,-1) and a constant term. This is
the same as (0,1) coding except that observations corresponding to the reference class are coded 1 for all the indicators. One can show that the intercept then represents an "average" intercept for
the classes. This type of coding is important in ANOVA. (2) k (0,1) indicators with no constant
term. The coefficients of the indicators are then interpreted as class-specific intercepts.

NKNW pp.481-2 & Sec 16.11 pp.696-701 discusses the relationship between regr & ANOVA.