Download s209 module 1 - UNC Computer Science

(1.1) Yi=0+1Xi + i i=1,2,...n (1) Yi is the value of the DV for the ith observation (2) 0 & 1 are parameters (or coeffs) (3) Xi is the value of the IV (predictor V) for the ith observation, and can be viewed as a known constant (4) i is a random error term with mean E{i}=0 & variance 2{i}=2 ; i & j are uncorrelated so that their covariance is 0 (i.e., {i, j}=0 for all i,j; i<j) The normal error regr model is: (1.24) Yi = 0 + 1Xi + i i=1,2,...n  Yi is the value of the dependent var for the ith observation  0 and 1 are parameters (or coefficients)  Xi is the value of the independent (predictor) var for the ith observation, and can be viewed as a known constant  i are independent N(0, 2) [normally distributed with mean 0 and variance 2]  This is the same model as (1.1) except that it assumes i are normally distributed. As a consequence, assumption that i are uncorrelated becomes assumption of independence.  The assumption of normality of the errors is justified for 2 principal reasons: (1) if the error terms i are the sum of many independent unmeasured factors, the central limit theorem implies that the distribution of i approaches normality as the number of factors becomes large. (2) Normality of i implies tests based on the Student t distribution. These tests are not sensitive to small departures from normality. Tests based on the normality assumption are therefore robust.  MAXIMUM LIKELIHOOD ESTIMATION: MODULE 1 - SIMPLE LINEAR REGR MODEL (SLRM) (1) Regr analysis: "a statistical methodology that utilizes the relation between 2 or more quantitative vars. so that one var. [the dependent or response var.] can be predicted from the other, or others [the independent or predictor vars.]."(2) A functional relation is exact, - a single value of the DV Y corresponds to a value of X. (3) A statistical relation is not exact, so that Y is not uniquely determined as a func of X. There is random scatter around a trend representing the relationship of Y w/ X. (4) The regr func of Y on X represents the relationship of the mean of the probability distribution of Y as a func of X. (The regr func need not be linear.) (5) In the most general form, the regr model formalizes the two ingredients of a statistical relationship: (A) there is a probability distribution of Y for each level of X (representing the scatter of points around the main trend) (B) the means of these probability distributions vary in some systematic fashion as a func of X SLRM - DISTRIBUTION OF ERRORS UNSPECIFIED   This model is (1) simple [only 1 independent var] (2) linear in the parameters (3) linear in the predictor var [X appears only as first power]. Properties of model 1.1: (1) Yi is a random var (RV), since it is a func of the random term i (2) E{Yi} = 0 + 1Xi . Therefore the regr funct is E{Y} = 0 + 1Xi (3) 2{Yi}=2{0+1Xi+i}=2i=2, since the variance of the error i is assumed the same regardless of the value of X (4) Yi & Yj are uncorrelated, since i & j are uncorrelated The parameters 0 and 1 are the regr coefficients: (1) the slope 1 indicates the change in the mean of the probability distribution of Y (i.e., E{Y}) per unit increase in X (2) the intercept 0 corresponds to the mean of the probability distribution of Y (E{Y}) at X=0 (only meaningful when the scope of the model includes X=0) ESTIMATION OF REGR FUNC BY THE METHOD OF OLS (Ordinary Least Square)  The OLS method consists in finding estimates b0 and b1 of 0 and 1 which minimize the quantity Q, which is the sum of the squared deviations ei = Yi - (  + 1Xi) of Yi from its expected value. (1.8) Q = i=1n (Yi - 0 - 1Xi)2  Q can be minimized by: (1) a numerical search using a "grid" of values for b0 and b1 (2) deriving an analytical solution, which is possible in this case. It can be shown that b0 and b1 that minimize Q are solutions to the simultaneous equations called the normal equations: (1.9a) Yi = nb0 + b1Xi (1.9b) XiYi = b0Xi + b1Xi2  One can calculate Yi, Xi, XiYi, and Xi2 from sample obs., and then solve the normal equations for b0 and b1 using formula. Calculate b1 first, then b0. b1 = ((Xi - X.)(Yi - Y.))/(Xi - X.)2 b0 = Y. - b1X. b0 & b1 are called "point estimates" of 0 & 1(in contrast to "interval estimates" in connection w/ statistical inference)  Properties of LS estimators: The Gauss-Markov theorem states that under the assumptions of the regr model 1.1, b0 and b1 are the BLUE. The G-M theorem holds even though the shape of the distribution of the errors is not specified (e.g., it needs not be normal). (1)Best: sampling distributions of b0 & b1 have minimum variance among all unbiased linear estimators (2)Linear: it can be shown that b0 & b1 are linear functions of the observations Yi (3)Unbiased: i.e., E{b0}=0 & E{b1}=1 (4)Estimator ESTIMATED MEAN RESPONSE: Regr func E{Y} = 0 + 1X is estimated as ^ Y = b0 + b1X where ^Y(Y hat) is the estimated regr func at level X of the IV. (^Y is also called "predictor" or "estimate" of Y.) One can show as an extension of G-M theorem that ^Y is also the BLUE of E{Y}. ^Yi is called the "fitted value" of Y for observation i. RESIDUALS: The ith residual ei is the difference between Yi & ^Yi: ei=Yi-^Yi=Yi-b0-b1Xi ei is not the same as  i: ei = Yi - ^Yi (where ^Yi is a known fitted value) i = Yi - E{Yi} (where E{Yi} is the unknown true expected value of Y) Properties of fitted values & residuals:  ei = 0 ei2 is minimum (by LS) Xiei = 0 ^Yiei = 0 ANALYSIS OF VARIANCE (ANOVA) INTERPRETATION  The total variation in Yi, (Yi - Y.) can be decomposed into two components: Y - Y. = Decomposition of i total deviation of Yi total variation in Yi from mean ^Yi - Y. + deviation of fitted value from mean Yi - ^Yi deviation of Yi from fitted value One defines sums of squares corresponding to each deviation. Sums of squares. Symbol SSTO Formula (Yi - Y.)2 = SSR = (^Yi - Y.)2 Name total sum of squares regr sum of squares Meaning total variation in Y variation in Y "accounted for" by regr line + + SSE (Yi - ^Yi)2 error sum of squares or residual sum of squares variation in Y around regr line This is actually a remarkable and non-obvious property that must be proven! Alternative computational formulas are: SSTO = Yi2 - nY.2 SSR = b12(Xi - X.)2  Degrees of Freedom(df):To each sum of squares correspond df. Df are additive. n-1 = 1 + (n - 2) DF of ANOVA sums of squares df for SSTO df for SSR df for SSE 1 df lost estimating Y. (cf. sample variance s2) 1 df lost estimating b1 2 df lost estimating b0 & b1  Mean squares are the sums of squares divided by their respective df. (not additive.) s2(Y) = SSTO/(n - 1) MSR = SSR/1 MSE = SSE/(n - 2) sample variance of Y regr mean square error mean square MSE is an estimator of 2, the variance of .(This makes sense since MSE is the sum of the squared residuals divided by the df of this sum, = n-2,.)It can be shown that E{MSE}=2. SQRT(MSE), standard error of estimate, is an estimator of , the standard deviation of . Source of variation Som of Squares df Mean Square E{MS} ANOVA SSR(^Yi - Y.)2 1 MSR=SSR/1 2 + 12(Xi - X.)2 Table for Regr simple Error SSE(Yi - ^Yi)2 n-2 MSE=SSE/(n-2) 2 Linear Regr Total SSTO(Yi - Y.)2 n-1 DESCRIPTIVE MEASURES OF ASSOCIATION  Coeff of (Simple or Multiple) Determination or "R-square" r2 = (SSTO - SSE)/SSTO = SSR/SSTO = 1 - SSE/SSTO where 0 <= r2 <= 1 Limiting cases:(1) if all observations on regr line, then SSE=0 & r 2=1 (2) if slope b1=0 then SSR=0 & r2=0 (3)r2 can be interpreted as the proportion of the variation in Y "explained" by the regr model. But "variation" refers to the sum of squared deviations, so variation is not measured in linear units. So interpretation of r 2 is not as intuitive as it may seem.)  Coeff of (Simple or Multiple) Correlation r = +/- SQRT(r2) = ((Xi - X.)(Yi - Y.))/SQRT((Xi - X.)2(Yi - Y.)2) The sign of r is that of b1. When r 2 is not 0 or 1, |r|>r 2 so that the psychological impact (or propaganda value) of r is stronger than that of r 2 .  The Standardized Regr Coeff b1* is calculated as: b1*=b1(sX/sY) (=r, in simple linear regr only) Conversely: b1=b1*(sY/sX) (=r(sY/sX), in simple linear regr only) where sX(SQRT(2{X}) & sY(SQRT(2{Y}) are the sample standard deviations of X & Y respectively. b1* indicates the change in the mean of the probability distribution of Y (i.e.,E{Y}), measured in standard deviation of Y units, associated w/ an increase in X of one standard deviation of X. (In multiple regr models, the standardized coeffs permit comparisons of the relative effects of different IVs.) [M2]NORMAL ERROR REGR MODEL & STAT. INFERENCE NORMAL ERROR REGR MODEL  When the functional form of the probability distribution of i is specified (in this case it is assumed normal), one can estimate the parameters 0, 1 and 2 by the maximum likelihood method. In essence, the ML method "chooses as estimates those values of the parameters that are most consistent with the sample data"  ML Estimation of the Normal Regr Model The ML estimates of 0, 1 are the values that maximize the likelihood func. Parameter ML Estimator Remark NOTE: maximum likelihood derivations often use the same as OLS 0 ^0 = b0 logarithm (base e) of L rather same as OLS 1 ^1 = b1 than L itself. 2 ^2 = SSE/n < OLS (MSE = SSE/(n-2))  1. INFERENCE FOR 1:Inference concerning 1 (i.e., the testing hypotheses and constructing confidence intervals) is based on the sampling distribution of b1. The sampling distribution of b1 "refers to the different values of b1 that would be obtained with repeated sampling when the levels of the independent variables X are held constant from sample to sample." (NKNW p. 45) Sampling Distribution of b1 STEP1:Sampling Distribution of b1:For normal err regr model, properties of sampling distribution of b 1 Mean: E{b1} = 1; Variance: 2{b1} = 2 / (Xi - X.)2; Functional form: ~ N so that b1 ~ N ( 1 , 2{b1}) (~ means "is distributed as") where 2 denotes the variance of i. These properties follow from the fact that b1 is a linear combination of the observations Yi. The sampling distribution of b1 is normal because the Yi are normally distributed and b1 is therefore a linear combination of independent normal RVs, which is normally distributed by Theorem A.40 in Appendix A. (The quantity (Xi X.)2 plays the same role as n in the sampling distribution of the sample mean. The greater the variance of X, the smaller the variance 2{b1} of the sampling distribution, and the more precise the estimation of 1 .In experimental research one may be able to space the X values optimally to achieve the smallest 2{b1} possible) STEP 2: Sampling Distribution of (b1 - 1)/ {b1} :(b1 - 1)/ {b1} ~ N (0, 1) This follows from the sampling distribution of b1 by the equivalent of the z transformation z = (X - )/ STEP 3: Sampling Distribution of (b1 - 1)/ s{b1} : In practice,  is not known so one replaces it with the sample estimate s{b1}, the square root of: s2 {b1} = MSE / (Xi - X.)2  When {b1} is replaced by the sample estimate s{b1}, the sampling distribution is no longer normal. It becomes a Student t distribution with (n-2) df (because 2 parameters were estimated): (2.10) (b1 - 1)/ s{b1}~ t (n - 2)  (Because s{b1}is a RV rather than a fixed constant. This "extra randomness" is manifested in the thicker tails of the t distribution as compared to the normal. This is the same phenomenon as for the sampling distribution of the sample mean.) 2. Hypothesis Testing for 1 Two- H0: 1 = 0 ("null hypothesis") H1: 1<> 0 ("research hypothesis") sided t* = ( b1 - 0)/ s{b1} = b1 / s{b1} test. if |t*| <= t(1-/2; n-2), conclude H0; if |t*| > t(1-/2; n-2), conclude H1  EX: in FLE, choose = .05, b1 = 0.377, s{b1}=0.014, t* = 0.377/0.014 = 26.408. t (0.975; 129) = 1.960 (from Table B.2) |t*| = 26.408 1.96 therefore conclude H1 (1< 0): "There is a sig linear association between female life expectancy and the literacy rate", or "the coeff of literacy is sig different from 0".  In practice one uses the 2-tailed (aka "2-sided") P-value associated with b1. The 2-tailed P-value is P{|t(n - 2)| t* = 26.408} ~= .000. One concludes H1 if the 2-tailed P-value shown on the regr printout is less than  H0 otherwiseHere the P-value is 0.000 < .05, so conclude H1 (1< 0). One-sided test H0: 1 <= 0 H1: 1 > 0 t* = ( b1 - 0)/ s{b1} = b1 / s{b1} Hint: Always write down H1 (the "research hypothesis") first; if |t*| <= t(1-; n-2), conclude H0; then H0 is the complement of H1 if |t*| t(1-; n-2), conclude H1  EX: choose = .05; t(1-; n-2)=t(0.95;129)=1.645; t* = 26.408 1.645, therefore conclude H11 0): "The coeff of the literacy rate is sig greater than 0".  In practice one calculates the 1-tailed (aka "1-sided") P-value associated with b1 by dividing the (2-tailed) P-value shown on the regr printout by 2. If the 1-tailed Pvalue is less than .05 one concludes H1 (1 0) Therefore, a 1-tailed test is "easier" (i.e., more likely to turn up "significant") than a 2-sided test! It is legitimate to use a 1-tailed test whenever one has a genuine directional hypothesis concerning 1. However, some puritanical statisticians frown at the use of 1-sided tests. 3. Confidence Interval (CI) for 1 : confidence interval b1 +/- t(1-/2; n-2)s{b1}  Confidence intervals for 1 are again based on the sampling distribution (2.10) (b1 - 1)/ s{b1}~ t (n - 2) It follows that P{t(/2; n-2) <= (b1 - 1)/ s{b1} <= t(1- /2; n-2)} = 1-where 1- is the confidence coeff. From which one can derive the (1-) confidence interval b1 +/- t(1-/2; n-2)s{b1}  EX: find the .95 CI for 1 in the female life expectancy example. t(1-/2; n-2) = t(.975; 129) = 1.96 Therefore CI is 0.377 +/- (1.96)(0.014) or (0.350, 0.404).  Relation of CI with Hypothesis Tests: verifying that the (1- CI for b1does not include the value 0 is equivalent to finding b1 sig in a direct 2-sided t test of b1.  INFERENCE FOR : This is done in exactly the same way as for  F TEST OF ENTIRE REGR MODEL : One can show that E{MSE} = 2 E{MSR} = 2 + 12 (Xi - X.)2 Note that if 1=0, E{MSR} = E{MSE} = 2. Therefore, an alternative to the 2-sided hypothesis test for 1 (H0: 1 = 0 ; H1: 1< 0) is the F test F* = MSR/MSE If 1=0, E{F*}=2/2=1; if 1<0, E{F*}>1. Thus the larger F*, the more likely that 1<0. If H0 holds it can be shown that F* ~ (2(1)/1 )/2(n-2)/(n-2))=F(1;n-2) The decision rule is if F* <= F(1- ; 1, n-2), conclude H0 if F* F(1- ; 1, n-2), conclude H1 EX: MSR/MSE=19,674.177/28.212=697.369~=697.363. From Table, F(.95;1,129) = 3.84. F* = 697.363 3.84 so one concludes that b1 is (rather massively) significant.  In this case the F test adds nothing to the t-test of b1. But in a multiple regr model the F test tests the significance of all the regr coefficients simultaneously. The only difference is that in the multiple regr case F* ~ F(p -1; n-p), where p -1 is the number of independent variables other than the intercept.  Equivalence of t Test and F Test :In the simple linear regr model, F*=(t*) 2. EX:in FLE example, (26.408)2=697.382 ~=697.363. This is not the case in the multiple regr model.  INTERVAL ESTIMATION (CI) FOR THE MEAN RESPONSE E{Yh}  For a given level Xh of X (which does not necessarily correspond to a data point), E{Yh} is estimated as ^Yh = b0 + b1Xh  The Sampling Distribution of ^Yh has properties (1) Mean: E{^Yh} = E{Yh} (unbiasedness) (2) Variance: 2{^Yh} = 2 ((1/n) + (Xh - X.)2 / (Xi - X.)2) (3) Functional form: ^Yh ~ N (because ^Yh is a linear combination of the Yi)  The variance of ^Yh depends on the distance of Xh from the mean X. of X. This is because given a change in b1, the change in ^Yh is larger further away from the mean. It follows that (^Yh - E{Yh}) / {^Yh} ~ N (0, 1)  In practice, one estimates unknown 2{^Yh} as s2{^Yh} = MSE ((1/n) + (Xh - X.)2 / (Xi - X.)2 ) So that (^Yh - E{Yh}) / s{^Yh} ~ t (n-2)  CI for a Single ^Yh: CI for ^Yh is ^Yh +/- t(1- /2; n-2)s{^Yh}  EX: To calc. CI one needs info not contained in standard regr output. X . =59.366; 1/(Xi-X.)2 can be estimated as 1/(s2{X}(n1))=1/((1064.572)(130))=7.225728(10-6) Choose Xh =70% literacy. Then ^Yh =62.602 & s2{^Yh}=0.48831. 95% CI is 62.602 +/- (1.96)(0.48831)=(61.645; 63.559)  CI for Several ^Yh and the Working-Hotelling Confidence Band. Calculating CIs for several ^Yh creates problems of multiple tests, necessitating the use of Bonferroni-type techniques. Alternatively, the W-H approach can be used to construct a confidence band for the entire regr line. The Working-Hotelling 1- confidence band for the regr line at any level Xh has the upper and lower values ^Yh +/- Ws{^Yh} where W2 = 2F(1- ; 2, n-2)  It can be shown that W is slightly wider than the corresponding t(1- /2; n-2) of the formula for a single Xh. W-H band can be used for any number of Xh levels.  PREDICTION INTERVAL (PI) FOR A NEW OBSERVATION Yh(new)  PI are esp important for quality control in an industrial context. Only the general principle is important for us. (EX: predict the life expectancy of a specific country not in the sample from estimated regr on literacy)  Variation in Yh(new) is affected by two sources of variation: (1) variation in the estimated mean of the distribution of Y given Xh and (2) variation within the probability distribution of Y around its mean given Xh Therefore, 2{Yh(new)} = 2 + 2{^Yh} estimated as s2{Yh(new)} = MSE + s2{^Yh}  Thus the estimated variance of the predicted Yh(new) is s2{Yh(new)} = MSE (1 + (1/n) + (Xh - X.)2 / (Xi - X.)2 ) as compared with s2{^Yh} = MSE ((1/n) + (Xh - X.)2 / (Xi - X.)2 ) MODULE 3 - DIAGNOSTICS & REMEDIES PROBLEMS TO BE INVESTIGATED WITH RESIDUAL ANALYSIS "Problems" are departures from the assumptions of the regr model. Such as: (1) Regr func is not linear (2) Error terms do not have constant variance (3) Error terms are not independent (4) Model fits all but one or a few outlier observations (5) Error terms are not normally distributed (6) One or several important predictor variables have been omitted from model. These problems also affect multiple regr models. Most of the diagnostic tools used for simple linear regr are also used with multiple regr. PLOTS OF RESIDUALS USED FOR DIAGNOSTIC  Plots typically can reveal more than one problem, so there is no one-to-one correspondence between plot types and the problems listed above. (1)Residual e by predictor var X, (2)Residual e by fitted value (or "estimate") ^Y : RESIDUAL/ESTIMATE, STUDENT/ESTIMATE, STUDENT/RESIDUAL, (3) Absolute or squared residual e by predictor var X, (4) Residual e by time or other sequence, (5) Residual e by omitted predictor var Z, (6)Box plot, stem-and-leaf, & ther plots of the distribution of residual e, (7) Normal probability plot of residual e.  The normal probability plot (7) is used to examine whether the residuals are normally distributed. It is a plot of the residuals against their expected values assuming normality. The expected values of the residuals are obtained by first sorting them in increasing order of magnitude. Then EXPECTEDk =z((k-0.5)/n) SQRT(MSE) for k=1,...,n where k is the rank of the residual & ((k0.5)/n) is the 100((k - 0.5)/n) percentile of the standard normal distribution N(0,1). When the residuals are normally distributed, the plot is approximately a straight line. (A)Especially for relatively small samples, experience is needed to assess visually the linearity of the plot. A formal correlation test (explained later) can be used. Alternatively, Wilkinson, Blank, & Gruber (1996) suggest to create probability plots for 10 random samples of size n from a normal population, & judge visually whether the empirical plot 'fits' w/in this collection(B)some computer implementation of the normal probability plot do not multiply by SQRT(MSE) in calculating the expected residuals, which does not affect the linearity of the plot; then the expected residuals are z scores. DIAGNOSTIC TESTS FOR RESIDUALS 1. Correlation Test for Normality is based on the normal probability plot. It is the correlation between the residuals ei and their expected values under normality. The higher the correlation, the "straighter" the normal probability plot, and the more likely that the residuals are normally distributed. The value of the correlation coeff can be tested, given the sample size and the a-level chosen, by comparing it with the critical value in Table B.6 in NKNW. A coeff larger than the critical value supports the conclusion that the error terms are normally distributed. EX: For the Yule data, the correlation between RESPRED and RESIDUAL corresponding to the normal probability plot is .981 with n=32. Table B.6 gives the critical values for n=30 of .964 with =.05. Since r=.981 .964 we have no reason to reject the hypothesis that the residuals are normally distributed. 2. Tests for Constancy of 2 (Homoskedasticity) (1) homoskedasticity: 2 is constant over the entire range of X (2) heteroskedasticity: 2 is not constant over the entire range of X  Two formal tests of constancy of the error variance are commonly used. (1) the Modified Levene Test (2) the Breusch-Pagan Test  I only present the Breusch-Pagan test. This is a large-sample test that assumes that the logarithm of the variance 2 of the error term ei is a linear func of X. The B-P test statistic is the quantity 2BP = (SSR*/2) / (SSE/n)2 where SSR* is the regr sum of squares of the regr of e2 on X SSE is the error sum of squares of the regr of Y on X. When n is sufficiently large and 2 is constant, 2BP is distributed as a chi-square distribution with 1 df. Large values of 2BP lead to the conclusion that 2 is not constant. 3. Linearity (or "Lack of Fit") Test: This test, which checks for linearity of the regr func, requires repeat observations (called replications) at one or more X levels, so it cannot be performed with all data sets. In essence the test for lack of fits tests whether the means of Y for the groups of replicates are sig different from the fitted value ^Y on the regr line, using a kind of F test. REMEDIAL MEASURES:  Data Transformations 1. Standardizing a var: Common var standardizations are the z-score and the range standardization. These transformations do not affect the shape of the distribution of the var (contrary to some popular beliefs). The standardized value Z of Y is obtained by the formula Z = (Y - Y.) / sY where Y. and sY denote the sample mean and standard deviation of Y, respectively. The range-standardized value W of Y is obtained by the formula W = (Y - YMIN) / (YMAX - YMIN) where YMAX and YMIN denote the maximum and minimum values of Y 2. Transforming a var to look normally distributed: Transformations can be used to remedy nonnormality of the error term. These transformations often also take care of unequal error variance that tends to go together w/ non-normality of the error term. Transformations of Y are appropriate when the error variance does not appear constant, as in the following situations:  Radical normalization with the rankit transformation: It transforms any distribution into a normal one. It is the same as "grading on the curve".  Tukey's lader of powers: Tukey (1977) has proposed a "ladder of powers" corresponding to a continuum of transformations to normalize the distribution of a sample. (Yp, .., Y2, Y1, Y1/2, LOG(Y), -Y-1/2, -Y-1, -Y-2,… -Y-p)  Automatic choice of ladder of powers transformation with the Box-Cox procedure: The family of power trasnformations in Tukey's "ladder" is of the form Y' = Ywhere l is the power parameter. The Box-Cox procedure estimates the parameter l by maximum likelihood together with the parameters of the regr models. The complete model is thus Yi = 0 + 1Xi + i. The estimates , 0 , 1 and 2 can be found by maximizing this func. Alternatively, one can do a numerical search over a range of values for .  Arcsine transformation for proportions and percents: Proportions and percentages often produce a truncated and skewed distribution when the mean is not near .5. The arcsine transformation corrects these problems. ynew=2*asn(sqr(P)) where P is a proportion. (Transform %'s to proportions by dividing by 100 before applying transformation.)  Fisher's z transformation for correlation coeffs: Correlation coeffs often produce a truncated & skewed distribution when the mean is not near 0. Fisher's z transformation (aka hyperbolic tangent) normalizes such distributions. ynew=ath(R) where R is a correlation between -1 & +1. 3. Transforming vars to linearize a relationship When the error variance is constant, only X needs be transformed 4. Mathematical and statistical functions available in SYSTAT  Weighted Least Squares: In rare cases where data transformations are not sufficient to take care of unequal error variance, one can use weighted least squares. In weighted least squares observations are weighted in inverse proportion to the variance of the corresponding error term, so that observations with high variance are downweighted relative to observations with low variance. We will discuss these techniques in the context of multiple regr.  Robust Non-Parametric Regr with LOWESS Module 4 - MATRIX REPRESENTATION OF REGR MODEL  Using matrices we can represent the simple and multiple regr model in a compact form. (1) Definition of Matrix :A matrix is a rectangular array of elements arranged in rows & columns Or, A=[aij] for i=1, 2, 3; j=1, 2 (In aij the 1st subscript always refers to the row index & the 2nd subscript to the column index.) (2) A square matrix is a matrix w/ the same number of rows & columns. (3) A (column)[row] vector is a matrix w/ one column[row]. "Vector" alone refers to a column vector (4) The transpose of a matrix A=[aij] is the matrix A' =[aji] in which the row & column indexes have been exchanged. Alternative notations (A' is AT). (5)Diagonal Matrix: Only diagonal elements are non-0. (6) Identity Matrix I3x3. Diagonal is 1 everything else is 1. In general AI=IA=A, so that I can be dropped in simplifying exprs. (7)A scalar matrix is a diagonal matrix w/ all diagonal elements the same. Multiplying by I is the same as multiplying by . (8) Zero Vector: A vector 0 composed entirely of zeroes.  Vectors and Matrix with All Elements Unity:(1) lrx1 is a column vector with all elements 1 (2) Jrxr is a square matrix with all elements 1. These matrices are very useful for representing sums & means in matrix notation. EX: l'l=n (where l is n x 1); ll' = J(where l is nx1 and J is nxn) LINEAR DEPENDENCE & RANK OF A MATRIX  c column vectors Ci are linearly dependent when c scalars 1, ..., c, not all zero, can be found such that 1C1 + 2C2 + ... + cCc = 0 (this 0 is a column vector) If this only holds when all the 's are zero, then the column vectors are linearly independent. The columns of A are linearly dependent when one column can be expressed as a linear combination of the others  To be able to estimate the regr coeffs in multiple regr the columns of the matrix X of IVs must be linearly independent. (The rows of X do not have to be linearly independent, so that some of the observations may have the same values of the IVs. But there must be more linearly independent rows than the number of IVs, including the constant.)  Rank of Matrix: the max number of linearly independent columns in the matrix. INVERSE OF A MATRIX: (1) The inverse of A, if it exists, is defined as another matrix A-1 such that A-1A = AA-1 = I where I is the identity matrix. (2) The inverse of a square rxr matrix exists if the rank of A=r.  A is nonsingular.(3) Computation of Inverse: Inverses are calculated by the computer using algorithms called Gaussian elimination or LU decomposition. (4)Use of Inverse: A system of linear equations such as AY = C is solved by premultiplying both sides by the inverse A-1 (assuming A-1 exists)  A-1AY = A-1C so that Y = A-1C is the solution (because A-1A simplifies to I) (A+B)+C=A+(B+C) (ABC)'=C'B'A' (A')'=A THEOREMS A+B=B+A C(A+B)=CA+CB (ABC)-1=C-1B-1A-1 (A-1)-1=A (AB)C=A(BC) (A+B)'=A'+B' (A')-1=(A-1)'  (A+B)=A+B RANDOM VECTORS & MATRICES:Vectors (matrices) containing elements that are random vars  Expectation of Random Vectors or Matrix is the vector(matrix) whose elements are the expectations of the RVs that are the elements of the vector (matrix).    In general, for a nxp matrix Y, E{Y} = [E{Yij}] for i = 1,...,n and j = 1,...,p. EX: (1) Y is a random vector, Y' = [Y1 Y2 Y3] The expected value of Y is the vector E{Y}, such that E{Y}' = [E{Y1} E{Y2} E{Y3}] (2) EX: In the regr model  is the vector of errors, so that ' = [1 2 ... n].Then E{} = 0nx1 so that E{}'=[E{1} E{2} ... E{n}]=[0 0 ... 0]. Variance-Covariance Matrix of Random Vector denoted 2{Y} of a random vector Y is the matrix containing the variances and covariances of the elements of Y. (distinguish the matrix 2{Y} from the scalar 2{Y}.) EX: for Y' = [Y1 Y2 Y3], 2{Y} = 2{Y1} {Y1,Y2} {Y1,Y3}  {Y2,Y1} 2{Y2} {Y2,Y3}  {Y3,Y1} {Y3,Y2} 2{Y3}  The matrix is structured so that      EX: for Y' = 2{Y1} {Y2,Y1} ... {Yn,Y1} [Y1 Y2 ... Yn], 2{Y} = ... {Y1,Y2} {Y1,Yn} ... 2{Y2} {Y2,Yn} ... ... ... {Yn,Y2} ... 2{Yn} (1) variances are on the main diagonal (2) the covariance of elements i and j of Y are in the ith row and jth column (3) the matrix 2{Y} is symmetric because {Yi, Yj} = {Yj, Yi} for all i<j It can be shown that the variance-covariance matrix can be represented in matrix notation as 2{Y} = E{[Y - E{Y}] [Y - E{Y}]'} Results follow from the abstract definitions of the variance & covariance of RVs: 2{Y}=E{(Y-E{Y})2} & {Y, Z}=E{(Y - E{Y})(Z - E{Z})} EX: The regr model assumes that the error terms have constant variance, 2{i} = 2 , and are uncorrelated, { i , j } = 0. Therefore 2{}=E{'}=2I= Convince yourself that E{'} really yields the matrix. 2 0 ... 0 0 2 ... 0 ... ... ... ... 0 0 ... 2 Basic Theorems on Random Vectors & Matrices: Y is a random vector, A a constant matrix, & W=AY. Then (1) E{A}=A (2) E{W}=E{AY}=AE{Y} (3) 2{W}=2{AY}=A2{Y}A' where 2{Y} is the variance-covariance matrix of Y. These formulas are the equivalent for random vectors of the rules for the expectation & variance of a linear func of a random var Y LINEAR REGR MODEL IN MATRIX TERMS  The simple linear regr model is Yi = 0 + 1Xi + i i = 1,..., n so that for the entire data set         we have Y1 = 0 + 1X1 + 1; Y2 = 0 + 1X2 + 2 ; ...; Yn = 0 + 1Xn + n 1 X1 Defining vectors and matrices Y, X, , and  such that 1 X2 Y' = [Y1 Y2 ... Yn], '= [1 2 ... n], ' = [0 1] and X= ... ... the regr model for the entire data set can be written Ynx1 = Xnx2  2x1 + nx1 The GLRM can be written Yi =0 + 1Xi + 2Xi2 + ... + p-1Xi,p-1 + i i = 1,..., n 1 Xn where there are p-1 IVs Xi1 to Xi,p-1 plus the constant X0, for a total 1 X11 X12 ... X1,p-1 1 X21 X22 ... X2,p-1 of p variables on the right hand side of the equation. ... ... ... ... Defining Y &  as before, and ' = 0 1 ... p-1] and X= the regr model for the entire data set can be written Ynx1=Xnxppx1+ nx1 1 Xn1 Xn2 ... Xn,p-1 (1) Y is a vector of responses; (2)  is a vector of parameters; (3)X is a matrix of constants; (4)  is a vector of independent normal random variables The only difference between SLRM & GLRM is the dimensions of X & of . assumptions concerning  are that E{}=0 & the variance-covariance matrix 2{}= E{'} = 2I.  Then it follows that the random vector Y has expectation E{Y} = X and the variance-covariance matrix of Y is the same as that of  so that 2{Y} = 2I. LEAST SQUARES ESTIMATION of the REGR COEFFICIENTS    (1) The OLS estimator of  are the values of the parameters that minimize Q =  i=1 to n (Yi - 0 - 1Xi - 2Xi2 - ... - p-1Xi,p-1)2 (2) It can be shown that the OLS estimator b of  is the vector b' = [b0 b1 ... bp-1] that is the solution of the normal equations X'Xb = X'Y (3) The normal equations are obtained by setting the partial derivatives of Q with respect to the regr coefficients equal to zero to minimize Q. Obtaine he solution b that minimizes Q by premultiplying both sides by (X'X)-1, (X'X)-1X'Xb = (X'X)-1X'Y (Since (X'X)-1X'X = I, and Ib = b,) The fundamental formula of OLS: b(px1)=(X'(pxn) X(nxp))-1X'(pxn) Y(nx1) FITTED VALUES & RESIDUALS  Fitted Values (aka Predictors): The nx1 vector ^Y containing the fitted values, ^Y' = [ ^Y1 ^Y2 ... ^Yn] is calculated as ^Y = Xb= X(X'X)-1X'Y= HY  The matrix H= X(X'X)-1X' is the hat matrix. H has the following properties (1) H is square of dimension nxn and involves only the values of X (assumed to be fixed constants). Therefore ^Y is seen as a linear combination of the observations Y. (2) H is very important in outlier diagnostics, as discussed later. (The diagonal elements hii of H measure the leverage of observation Yi on its own fitted value ^Yi.) (3) H is idempotent, i.e., HH = H. (A glimpse at deep mathematical truth: an idempotent matrix represents a projection of a point in space on a subspace. Idempotency reflects the fact that once a point is projected on the subspace, the point stays there if projected again.)  Residuals: The vector e contains the residuals ei = Yi - ^Yi so that (1) e = Y - ^Y = Y - Xb or (2) e = Y - ^Y = Y - HY or (3) e = (I - H)Y  Like H, the matrix (I - H) is n x n, symmetric and idempotent.  Variance-covariance Matrix of Residuals: It can be shown that the variance-covariance matrix of e is 2{e} = 2(I - H) and is estimated as s2{e} = MSE(I - H) ANALYSIS OF VARIANCE : Sums of Squares: ANOVA Table in Matrix Form Source of Scalar Standard Matrix Quadratic Matrix Intermediate Result df MS Variation Notation Form Form SSR SSR= SSR = MSR= Regr p-1 SSR/(p-1) b'X'Y - (1/n)Y'JY Y'[H - (1/n)J]Y (^Yi - Y.)2 SSE SSE = e'e = SSE= SSE = MSE= Error n p SSE/(n-p) Y'Y - b'X'Y Y'(I - H)Y (Yi - ^Yi)2 (Y- Xb)'(Y - Xb) SSTO SSTO SSTO= SSTO = Var(Y)= Total n-1 2 2 2 SSTO/(n-1) Y'Y - (1/n)Y'JY Y'[I - (1/n)J]Y (Yi - Y.) Yi - (1/n)(Yi ) (1) to derive the quadratic forms, note that b'X' = (Xb)' = ^Y' = (HY)' = Y'H; replace b'X' by Y'H; then Y' and Y can be factored out as premultiplying and postmultiplying the expression inside (2) the df corresponding to the sums of squares now involve p, the total number of independent variables in the model including the constant term (3) one can verify that Y'JY is equal to (Yi )2 with a 2x2 matrix (4) the quadratic expr are all of the form Y'AY where A is a symmetric matrix     Sums of Squares as Quadratic Forms: A quadratic form is an expression of the form Y'AY where A is a symmetric matrix. Then Y'AY = i=1 to n i=1 to n aijYiYj where aij = aji and Y'AY is 1x1 (a scalar). One can verify that Y'AY is a second-degree polynomial involving the squares and cross products of the observations Yi. 5 3 EX:5Y12 + 6Y1Y2 + 4Y22 can be represented as Y'AY where 3 4 Y' = [Y1 Y2] and A (A is called the matrix of the quadratic form) = All sums of squares in ANOVA for linear statistical models can be expressed as quadratic forms.(nxn quadratic forms represent distances in a n-dimensional space) F Test for Regr Relation: F test is used to test whether the p-1 regr coeffs are all equal to 0. The setup is H0: b1 = b2 = ... = bp-1 = 0 H1: not all bk (k = 1,...,p-1) equal zero  The test is the same as for simple regr except for the df..  F* = MSR/MSE The decision rule is  if F* <= F(1- ; p-1, n-p), conclude H0 if F* F(1- ; p-1, n-p), conclude H1  Coeff of Multiple Determination (R2=SSR/SSTO = 1 - SSE/SSTO (same) The coeff of multiple correlation r is the square root of R2 and is always positive. The adjusted coeff of multiple determination or adjusted R2 corrects for the number of predictor variables by dividing each sum of squares by its associated df: Ra2=1 - (SSE/(n - p))/(SSTO/(n - 1)) = 1 - ((n - 1)/(n - p)) SSE/SSTO If it is adjusted it must be better! Note as n becomes large the difference between R2 and Ra2 vanishes. INFERENCE ON REGR COEFFICIENTS  The OLS & ML estimators of the coefficients in b are unbiased, so E{b} = b  Note that b is a random vector, since it is estimated from a sample. Thus it has a variance-covariance matrix. It is pxp matrix 2{b} = 2 (X'X)-1.(var-covar of b)  The estimated variance-covariance matrix of b is given by s2{b} = MSE (X'X)-1  To derive expr one notes that b = AY with A = (X'X)-1X'. Thus b is a linear func of Y.  Thus the variance of the kth regr coeff s2{bk} is MSE times the kth diagonal element of (X'X)-1. The standard error of estimate s{bk} is the square root of s2{bk}. With the normal error regr model, the sampling distribution of bk is such that ( bk - k )/ s{bk} ~ t (n - p) for k = 1, ... , p - 1  Hypothesis Tests for bk : One-tailed tests are done in a similar way. Two-tailed setup is H0: k = 0 H1: k < 0 The test statistic t* is t* = bk/s{bk} The decision rule is: If |t*| <= t(1-/2; n-p) conclude H0 if |t*| t(1-/2; n-p) conclude H1. OR conclude H1 if the 2-tailed P-value P{|t(n - p)| t*} < , conclude H0 otherwise  CI for bk :Confidence limits for bk w/ confidence coeff 1- are bk+/- t (1-/2; n-p) s{bk}  CI for Mean Response E{Yh}: Given the vector Xh of specific values of the IVs such that Xh'=[1 Xh1 Xh2 ... Xh,p-1] the mean response is E{Yh} = Xh'& is estimated as ^Yh=Xh'b  The expectation and variance of ^Yh are E{^Yh} = Xh'b = E{Yh} (unbiasedness) 2{^Yh} = 2Xh'(X'X)-1Xh = Xh'2{b}Xh  The variance of^Yh is estimated as s2{^Yh} = MSE(Xh'(X'X)-1Xh) = Xh' s2{b}Xh 2 so that s (^Yh} is func of the variances & covariances among estimated coefficients.  The 1 -  CI for E{Yh} is ^Yh +/- t(1 - /2; n - p) s{^Yh}  Working-Hotelling Confidence Region for Mean Response: To estimate a 1 -  confidence   region around the regr "surface" or test multiple mean responses, one can use the WH confidence region ^Yh +/- Ws{^Yh} where W2 = pF(1 - ; p, n - p) Prediction interval for new observation Yh(new):1- prediction interval for Yh(new)corresponding to  In Constructing Social Theories, Arthur Stinchcombe writes "A causal law is a statement or proposition in a theory which says that there exist environments ... in which a change in the value of one var is associated with a change in the value of another var and can produce this change without any change in other variables in the environment" (p. 31). Such a causal theory can be represented schematically as X (independent) -- Y (dependent)  Stinchcombe argues further that to support or refute a causal theory we must: (1) observe diff values of X (i.e. causal analysis is inherently comparative, in that one must have data on several cases with different values of the IV; this emphasizes the inherent limitation of case studies in establishing general causal patterns) (2) observe covariation (i.e., an association between X and Y) (3) ascertain the direction of causality (4) ascertain nonspuriousness (i.e., whether one or more other variables affect both X and Y and thereby produce an apparent association between X and Y that is spuriously attributed to a causal influence of X on Y)  Multiple regr analysis can be used to help achieve points 2(measuring association) & 4(ascertain nonspuriousness). More advanced techniques (involving nonrecursive models) can help with point 3(ascertaining the direction of causality).  Ascertaining nonspuriousness is equivalent to eliminating alternative hypotheses on the source of the relationship between X & Y. In the context of regr analysis, spuriousness (=alternative explanations of the oberved association between X & Y) is called specification bias. Specification bias is a more general and continuous notion than spuriousness. The idea is that if a regr model of Y on X excludes a var that is both associated w/ X & a cause of Y (the model is then called misspecified) the estimated association of Y w/ X will be inflated (or, conversely, deflated) relative to its true value. The regr estimator, in a sense, falsely "attributes" to X a causal influence that is in reality due to the omitted var(s). An example illustrating how multiple regr can be used to test for nonspuriousness. (D-score by sex & scatter plot of D-score by age & sex)  The Elaboration Model & the Standard Tabular Presentation of Regr Results Table below shows the standard representation of the elaboration model using regr analysis. "Unstandardized Regr Coeffs of Cognitive Development (D-score) on Sex & Age for 12 Children Aged 3 to 10 (t Ratios in Parentheses)"  the title of the table contains information on the type of regr coeff's shown (here, unstandardized coeffs), the DV, the IV's (when there are too many to list in the title, say "on selected IV's"), the nature of the units of observation (children aged 3 to 10), & the sample size (12); when n is not the same in all the models (e.g. because of missing data), state the maximum n in the title & specify the actual sample sizes in a row labeled "N" below "Adjusted R-square"  the IV are introduced one at a time in successive models shown in the different columns of the table; variants of this strategy often introduce together sets of related vars, like 1. a set of indicators representing a Independent var Model 1 Model 2 categorical var Constant 10.305*** 6.927*** 2. different powers of X (15.109) (13.697) representing a polynomial func Boy ( boy=1, girl=0) 2.288* -.126 3. variables related substantively (2.372) (-.262) (EX: father's education, mother's Age (years) -.753*** education, and family income as (7.775) indicators of family SES) R-square .360 .917 Note the following points: Adjusted R-square .296 .899 Note:*p< .05 **p< .01 ***p < .001(2-tailed tests)  significance levels of the coeff's are indicated with asterisks. ASR convention is shown. A legend at the bottom of the table indicates the meaning of the symbols & specifies the type of test (1- or 2-tailed). Both 2- & 1-tailed tests can be used in the same table by using a different symbol for 1tailed tests. EX: add a line at the bottom with: + p< .05 ++ p< .01 +++ p < .001 (1-tailed tests)  both R-square & Adjusted R-square are shown; reviewers will often insist that you show the adjusted R-square, even though N may be so large it makes no difference. (Never omit the regular ("unadjusted") R-square, as this can be used to reconstruct F-tests from the table!)  t-ratios(t-r) are shown in parentheses below the regr coeffs; it is better to present the t-r than the standard error of estimate, as the t-r's are in the same metric (a Student t variate w/ n-p df) while the standard errs are in the metric of the respective coeffs; therefore, in general, standard errs of estimate cannot be compared & suffer different degrees of rounding err depending on the metric, so that the t-r for some coeffs cannot be calculated from the table w/ acceptable precision.  a place holder (--) is used to show that a var is not included in a model;  the IVs are labeled with human readable text, not the computer symbol; use a name for the var that is consistent with the numerical scale (e.g., SES must have values that are large for high SES and small for low SES; values of "Democracy" must be high for democratic countries and low for non-democratic ones); define indicator variables explicitly when any doubt is possible (e.g., an indicator called SEX must specify whether MALE = 1 or FEMALE = 1; calling the indicator MALE (with 1 for male and 0 for female) resolves this problem GENERAL LINEAR MODEL: FIRST ORDER MODELS  First Order Models W/ p-1 IV's the general linear regr model can be written Yi = 0 + 1Xi + 2Xi2 + ... + p-1Xi,p-1 + i i = 1,..., n where there1 X11 X12 ... X1,p-1 are p-1 independent variables Xi1 to Xi,p-1 plus the constant X0, for 1 X21 X22 ... X2,p-1 a total of p variables on the right hand side of the equation. ... ... ... ... 1 Xn1 Xn2 ... Xn,p-1 Defining Y and  as before, and ' = [0 1 ... p-1] and X =  the regr model for the entire data set can be written Ynx1 = Xnxp px1 + nx1  In the model (1) Y is a vector of responses (2)  is a vector of parameters (3) X is a matrix of constants (4)  is a vector of independent normal random variables such that E{} = 0 and the variance-covariance matrix 2{} = E{'} = 2I.  It follows that random vector Y has expectation E{Y} = X and the variancecovariance matrix of Y is the same as that of  so that 2{Y} = 2I  The response func. E{Y} = X = E{Y} = 0 + 1X1 + 2X2 + ... + p-1Xp-1       Xh is ^Yh+/-t(1-/2;n-p)s{pred} where s2{pred}=MSE+s2(^Yh}=MSE(1+ Xh'(X'X)-1Xh) One can use the Bonferroni approach too for (1) confidence region and for (2) inference in predicting the mean of m new observations or predicting g new observations.  Problem of Hidden Extrapolation in Estimating the Mean Response E{Yh} Module 5 - GENERAL LINEAR MODEL NEED FOR MODELS WITH MORE THAN ONE INDEPENDENT VAR  Motivations for Multiple Regr Analysis: The 2 principal motivations for models with more than one IV are: (1) to make the predictions of the model more precise by adding other factors believed to affect the dependent var; in technical terms, the goal is to reduce SSE (or to increase the adjusted R-square). EX: in a model explaining prestige of current occupation as a func of years of education, add SES of family of origin and IQ (2) to support a causal theory by eliminating potential sources of spuriousness. Also called the elaboration model. EX: in a study of productivity of work crews, include both size of crew and level of bonus pay  Supporting a Causal Theory by Eliminating Alternative Hypotheses    When the X's represent all diff predictors the model is called the first order model w/ p-1 vars (the indexing of the vars starts at 0, w/ 0 indexing the intercept & p-1 indexing the last IV) regr parameters are interpreted as follows(1) 0 is the mean response when X1=...= Xp-1=0 (which is only interpretable when X1=...=Xp-1=0 is included in the scope of the model)(2)k is the change in mean response E{Y} per unit increase in Xk when all the other predictors are held constant The k's are sometimes called partial regr coeffs, but more often just regr coeffs, or unstandardized regr coeffs (to distinguish them from standardized coeffs). (Mathematically, k corresponds to the partial derivative of the response func w/ respect to Xk: E{Y}/Xk = k .) Geometry of the First Order Multiple Regr Model The response func(also called regr func or resp surface) defines a hyperplane in p-dimensional space. When there are only 2 predictor vars (besides the constant) the response surface is a plane. EX: first order model with 2 predictors with response func E{Y} = 10 + 2X1 + 5X2 where Y is test market sales in 10K dollars, X1 is point-of-sales expenditures in K dollars, & X2 is TV expenditures in K dollars. 1 = 2 means that, irrespective of the value of X2, increasing point-ofsales expenditures by 1K dollars increases sales by 20K dollars (2*10K). The parameter 2 is interpreted similarly. In the first order model the effect of a var does not depend on the values of the other vars. The effects are thus called additive or not interactive. The response func is a plane. For example, if X2=2 it follows that E{Y}=10+2X1 + 5(2)=20+2X1, a straight line. When there are more than 2 IVs (in addition to the constant) the regr func is a hyperplane & can no longer be visualized in 3-dimensional space. (There is an alternative geometry for multiple regr that represents the problem in n-dimensional space, where n is the number of observations. Then the vector Y of observations on the DV & each vector Xk of observations on an IV correspond to points in that n-dimensional space. In that representation, OLS estimates the perpendicular projection of the vector Y on the subspace "spanned" by the vectors Xk) Standard Regr Output for Multiple Regr EX:The DV is change in pauperism (PAUP). The predictor of interest is the out-relief ratio(OUTRATIO). The control vars change in proportion of the old(PROPOLD) & change in population (POP) are introduced into the model in turn. Most elements of the multiple regr output are the same as in the simple regr:2 additional elements 1. Standardized Regr Coefficients: bk* is calculated as:  2.    bk* = bk(s(Xk)/s(Y)) Conversely: bk = bk*(s(Y)/s(Xk)) where s(Xk) & s(Y) denote the sample standard deviations of Xk & Y, respectively. (1) The standardized coeff b k* measures the change in standard deviations of Y associated with an increase of one standard deviation if X. (2) Standardized coefficients permit comparisons of the relative effects of different IVs measured in different metrics (= units). (3) EX: In the full model with Yule's data, the standardized coeffs are OUTRATIO .584, PROPOLD .031, POP -.570. The coeff of OUTRATIO means that a change of one standard deviation unit in OUTRATIO is associated with a change of .584 standard deviations of PAUP. The other coefficients are interpreted similarly. The coefficients show that the effects of OUTRATIO & POP are strong & of comparable magnitude, although they are in opposite directions (.584 & -.570) & that the effect of PROPOLD is negligible (.031). Tolerance or Variance Inflation Factor: The standard multiple regr output often provides a diagnostic measure of the collinearity of a predictor with the other predictors in the model, either the tolerance or the variance inflation factor. 2 k  tolerance TOL = 1 - R where Rk is the R-square of the regr of Xk on the other p2 predictors in the regr & a constant. A TOL value close to 1 means that X k is not highly correlated w/ the other predictors in the model. A value close to 0 means that Xk is highly correlated w/ the other predictors; one then says that Xk is collinear w/ the other predictors. VIF = (TOL)-1 = (1 - Rk2)-1 The variance inflation factor(VIF) is the inverse of the TOL. Large values of VIF therefore indicate a high level of collinearity. E{Y} w/ respect to X1 and X2, respectively: E{Y}/X1=1+3X2 E{Y}/X2=2+ 3X1 EX: compare the additive model (a) E{Y} = 10 + 2X1 + 5X2 to the interaction models (b)E{Y}= 10+2X1+5X2+0.5X1X2 :reinforcement effect(c)E{Y}=10+2X1+5X2-0.5X1X2 interference effect) The plots of Y against X1 conditional on the value of X2 are called conditional effects plots. Interaction effects can also be represented by drawing the response surface (Y as func of X1 & X2) in 3-dimensions or using contour plots.(color, contour, xycut(B&W grid)  Plot a response surface & 3 representations of the interaction model(Prob 7.39 p.323) Module 7 - QUALITATIVE INDEPENDENT VARIABLES INDICATOR VARIABLES(IrV): An IrV (or "binary var") can have only two values, 0 or 1. IrVs are used to represent qualitative(nominal or categorical) vars. A qualitative (nominal or categorical) var w/ k classes is represented in the regr model by k-1 indicators. (Indicators are also called "dummy vars" but this usage is unfortunate as it falsely implies that indicators are not "real" vars in some sense.) Regr models with IrVs are very important in part because they represent a bridge between regr analysis and a set of statistical techniques called analysis of variance(ANOVA) & analysis of covariance(ANCOVA). It can be shown that all ANOVA or ANCOVA models can be represented as regr models containing suitably defined indicators.  EX: in a study of the sales of restaurants, restaurant location is a qualitative var with 3 categories: Highway, Shopping Mall, or Street. The indicators used are X2 = 1 if Shopping Mall location, 0 otherwise X3 = 1 if Street location, 0 otherwise  Highway location does not have an indicator associated with it. Highway location is called the "reference", "baseline", or "omitted" category or class. The reference class is the class for which every indicator is set to zero. The restaurant data are set up as in the next exhibit.     2 A common rule of thumb is to take TOL < .1 or VIF 10 as an indication that collinearity may unduly influence the results. EX: In the full model with Yule's data, the TOLs are OUTRATIO .985, PROPOLD .711, POP .719. The corresponding VIFs are 1.015, 1.406, and 1.391, respectively. These values do not indicate any problem with collinearity.  Options of the General Linear Model:The general linear model need not contain only first powers of different predictors. The X's can also represent (1) different powers of a single var (polynomial regr) (2) interaction terms represented as the product of two or more vars (3) qualitative (categorical) vars represented by one or more indicators (vars w/ values 1 or 0, aka "dummy variables") (4) mathematical transformations of vars (EX: use of polynomial expressions, categorical vars, & mathematical transformations of a var w/in the general linear model.) Module 6 - POLYNOMIAL REGR & INTERACTIONS POLYNOMIAL REGR WITH ONE PREDICTOR VAR  A nonlinear relationship between Y & X can often be approxly represented w/in the general linear model as a polynomial func of X.  EX:Yi=0+1Xi+2Xi2+i may be represented as Yi=0+1Xi1+2Xi2+i where Xi1=Xi, Xi2=Xi2  To estimate a polynomial func X is often first deviated from its mean (or median) to reduce collinearity between X and higher powers of X. A var deviated from its mean is called centered. The transformation is x = X - X. where x (lower case) represents the centered var.  A polynomial func can be used when (1)the true response func is polynomial (2)the true response func is unknown but a polynomial is a good approx of its shape  Note the response func E{Y} for any polynomial model w/ one predictor var can be represented on a 2-D plot of Y against X. A 2nd degree polynomial implies a parabolic relationship. The signs of the coeffs determine the shape of the response func. When 2 is positive, Y increases as the values of X inc. When 2 is negative, Y eventually decreases as the values of X inc.  EX: Kuznets curve postulates an inverted U-shaped relationship between income inequality & economic development. This curvilinear relationship is often approx w/ a 2nd degree polynomial (aka quadratic func). The, development of a country is measured as logged GDP per capita.  Higher degree polynomials produce curves with more inflection points.  When estimating a polynomial func, it is often useful to test for the joint significance of the coefficients of X, X2, and highr powers of X, in addition to testing for the significance of each coeff separately. In a joint test of significance one tests H0: 1=2= 0 against the alternative that at least one of the coeff is not zero. We will see later how to do joint significance tests.  NOTES (1) when evaluating the shape of a polynomial response func, it is necessary to keep within the range of X in the data, as extrapolating beyond this range may lead to misleading predictions (2) it is possible to convert from the coefficients of the centered model (involving x) to the non-centered model involving the original X, however, the conversion is rarely needed for substantive purposes (3) fitting a polynomial regr with powers higher than three is rarely done as the interpretation of the coefficients becomes difficult and interpolation tends to become erratic. (A polynomial of order n-1 can always be fitted exactly to n points.) (4) polynomial regr models are often fitted with the hierarchical approach in which higher powers are introduced one at a time and tested for significance, and if a term of a high order is included (say, x 3) then all terms of lower order (x and x2) are also included. POLYNOMIAL REGR WITH MORE THAN ONE PREDICTOR  A second-order model with two predictors has the general response func E{Y} = 0 + 1x1 + 2x2 + 11x12 + 22x22 + 12x1x2 where x1 = X1 - X.1 x2 = X2 - X.2  The x vars are centered. The indexing of the coeffs reflects the composition of the corresponding term. The response func is a quadratic func of x1 & x2. The product term x1x2 represents the interaction of x1 & x2. The coeff 12 therefore represents the effect of the interaction of x 1 & x2 on Y. The response func of a polynomial regr model of any order involving two predictors must be represented in 3-dimensional space with axes Y, x1, and x2. The response func can then be represented in perspective as a surface in 3-dimensional space or by contour cuves (similar to level curves in topographical maps). Each contour curve represents the combinations of x1 and x2 that yield the same value of the response Y.  Polynomial models involving more than 2 predictors are possible but the response func can no longer be represented in 3-dimensional space. INTERACTION REGR MODELS  A regr model with p-1 predictors is called additive if the response func can be written in the form E{Y}=f1(X1)+f2(X2)+...+fp-1(Xp-1) where f1, f2,...,fp-1 can be any func. Models that are not additive contain interaction effects. Interactions are commonly represented as cross-product terms called interaction terms.The simplest interaction model has resp func E{Y}=0+1X1+2X2+3X1X2  One may view this model as a degenerate form of a quadratic model, without the square terms. The meaning of the regr coeffs 1 & 2 is not the same as it is in a model without interaction. In the interaction model, the change in E{Y} with a unit increase in X1 when X2 is held constant is 1+3X2 and the change in E{Y} with a unit increase in X2 when X1 is held constant is 2 + 3X1  Therefore, the effect of both X1 & X2 depends on the level of the other var(So that the regr model is no longer additive.) NOTE: The effects of X1 & X2 are obtained by differentiating   The 2 indicators are included in the regr model Yi = 0 + 1Xi1 + 2Xi2 + 3Xi3 + i     where Xi1 is Number of Households in the area (a regular continuous var), and Xi2 and Xi3 are the two location indicators. To understand the meaning of the coefficients one must examine the regr func E{Y} = 0 + 1X1 + 2X2 + 3X3 E{Y} = 0 + 1X1 + 2X2 + 3X3 Regr func There areLocation E{Y} = 0 + 1X1 +  (0)+ 3(0) E{Y} = 0 + 1X1 three cases, Highway depending Shopping Mall E{Y} = 0 + 1X1 + 2(1) + 3(0) E{Y} = ( 0 + 2) + 1X1 on location Street E{Y} = 0 + 1X1 + 2(0) + 3(1) E{Y} = (0 + 3) + 1X1 The table shows that 2 & 3 represent the differences in intercept for Shopping Mall and Street location, respectively, relative to Highway (the omitted category). This can be seen by plotting the regr func for the 3 locations. A categorical var w/ k classes(categories) is represented by k-1 indicators, w/ 1 category omitted. Using k indicators to represent a categorical var w/ k classes would make the X'X matrix singular, so  cannot be estimated. In the following exp, units are insurance companies of 2 types:Stock or Mutual. X1:size of firm, X2:Stock Firm, X3:Mutual Firm. The following exhibits show another example of the use of indicator variables, using the insurance innovation data. The DV Y is elapsed time for an innovation to be adopted by insurance companies. There are 10 mutual firms and 10 stock firms. The continuous IV X1 is size of firm (in million dollars). The estimated regr func is (t-ratios in parentheses)^Y=33.87407- .10174X1+8.05547 X2 with n=20; r-square = 1,504.41/1,680.80 = .895 (18.68) (-11.44) (5.52) The separate regr funcs for Mutual & Stock firms can be plotted w/ the actual data points, w/ the type of firm identified by a symbol. This shows that the indicator model assumes (or constrains) the effect of firm size to be the same for both types of firm, so the regr lines are parallel. INDICATOR MODELS WITH INTERACTIONS  The coeff of a continuous IV (e.g.size of firm) may be allowed to vary as a func of an IrV (e.g. firm type) by using an interaction term. In the insurance innovation example the response func for the interaction model becomes E{Y}=0+1X1+2X2+3X1X2 where, as before, X1 is Size of Firm and X2 is Stock Firm (1 if stock company, 0 otherwise). Again to understand the meaning of the coeffs one must examine the regr (response) func for each category of the qualitative var. Type of Firm E{Y} = 0 + 1X1 + 2X2 + 3X1X2 Regr func Mutual Firm (X2=0) E{Y} = 0 + 1X1 + 2(0) + 3(0) E{Y} = 0 + 1X1 Stock Firm (X2=1) E{Y} = 0 + 1X1 + 2(1) + 3X1(1) E{Y} = (0 + 2) + (1+ 3 )X1  Thus in the interaction model both the intercept & the slope of the continuous var X 1 differ across types of firm. This can be seen by plotting the regr funcs for the two firm types. W/ a significant interaction term, the two regr lines might look as follows  For these data, however, it is found that the estimate of 3 is non-sig (t* = -.02). We might have expected the absence of a sig interaction from the appearance of the symbolic plot in NKNW Figure 11.2 p. 460, as the trends for the two firm types seem roughly parallel. More complex indicator models involving qualitative vars w/ more than 2 classes, more than 1 qualitative var, w/ or w/out interaction terms are possible. They are interpreted in essentially the same way. COMPARISON OF TWO OR MORE REGR FUNCTIONS  A very common research strategy is to study the similarities & differences between regr models for 2 or more populations.  EX: a comparison of two production lines for making soap bars. For each production line, the relation of interest is that between the amount of scrap for the day (the dependent var) and the speed of the production line. A symbolic scatter plot of Amount of Scrap against Line Speed suggests that the regr relation is not the same for the two production lines (next exhibit).  When it is reasonable to assume that the error term variances in the regr models for the different populations are equal, one can use indicator vars to test the equality of the diff regr funcs. (When the variances are not equal, a suitable transformation of Y can equalize them approximately. NKNW p. 472) This is done by considering the diff populations as classes of a predictor var, define indicator vars for the diff populations, & estimating a single regr model containing appropriate interaction terms, similar to the insurance innovation interaction model above.  In the soap production data define an interaction model w/ regr func E{Y}=0+1X1+2X2+3X1X2 where Y = Amount of Scrap, X1 = Line Speed, & X2 is Line 1 (=1 if production line 1, 0 if production line 2). The estimated regr model is (t-ratios in parentheses)   ^Y=7.57+1.322X1+90.39X2- .1767X1X2 [b0 =(.36); b1=(14.27); b2 =(3.19); b3 =(-1.37)] so one concludes that the slope of the relationship between Amount of Scrap & Speed does not differ across production lines (b3 is non sig w/ t* = -1.37) but that the intercepts are significantly different, so that Amount of Scrap is overall higher in Line 1 than in Line 2 (b2 is sig w/ t*=3.19). (One can formally test the equality of the regr lines with the joint test of H 0: 2 = 3 = 0) PIECEWISE LINEAR REGR & DISCONTINUITIES  Indicator vars can be used to model situations in which the slope of the regr of Y on X differs for two ranges of X  Assuming that Xp (the point where the slopes change) is known, the piecewise linear relation is represented by a regr model with response func E{Y}=0+ 1X1 + 2(X1 - 500)X2 where Y=Unit Cost, X1 = Lot Size, X2 = 1 if X1> Xp =500, = 0 otherwise, & Xp = 500. One can convince oneself that this func represents the piecewise linear relation by examining the response func separately for the range X1<=500 & the range X1 500, as shown in the next exhibit  The model can be est. from the data. estmted regr func: ^Y=5.89545-00395X1-.00389(X1-500)X2  Piecewise linear modelling can be easily extended to more than 2 pieces. See NKNW p. 477.  When Xp is not known, one may (1) attempt to guess its position from the scatterplot (2) use nonlinear regr techniques to estimate Xp together with the other parameters of the model iteratively.  One can also use indicators to model discontinuous piecewise linear regr relations(NKNW p477) INDICATORS VERSUS QUANTITATIVE VARIABLES Indicators Versus Allocated Codes  Qualitative variables with ordinal categories can often be represented either by allocated codes or by indicators. For example, persons of Hispanic origin might be asked the question "How often do you use Spanish at home?" with response categories Frequently, Occasionally, Never. The var may be coded with allocated codes in var X1, or with two indicators X2 and X3: Class X1 X2 X3  Alternative codings of frequency of Spanish use at home: X1:allocated codes; X2(FrequentUser) & X3(OccasionalUser): indicators Never 1 0 0  Using allocated codes X1 in the model w/ response funcE{Y}=0+1X1Ocasionally 2 0 1 constrains differences in response func among classes to be the same: Frequently 3 1 0  Using allocated codes X1 in the model with response func E{Y} = 0 + 1X1 constrains differences in response func among classes to be the same:  Never: E{Y}=0 + 1. 1 Occasionally: E{Y} = 0 + 2. 1 Frequently: E{Y} = 0 + 3. 1  Therefore , E{Y|Frequently}-E{Y|Occasionally}=E{Y|Occasionally}-E{Y|Never} = 1  assumption of constant diff in effect among contiguous classes may not be substantively plausible.  Using indicators X2 & X3 in the model w/ response func E{Y}=0+X2+3X3 allows differences in response funcs among classes to be different: (1)E{Y|Frequently}-E{Y|Never}=  (2) E{Y|Occasional}-E{Y|Never}= (3) E{Y|Frequently}-E{Y|Occasional}=(2 -  3)  Thus in the indicators model the effects of the classes are not arbitrarily restricted. Indicators Versus Quantitative Var : For the same reasons, it may be useful to use a set of indicators rather than continuous values even when a var is inherently quantitative. EX: to study the relationship of earnings with age, age in years is represented by a set of indicators corresponding to 5-years categories 20-24, 25-29, 30-34, etc. Using indicators may allow for a better "tracking" of the non-linear relationship between Y (earnings) and X (age). The disadvantage of indicators is that more degrees of freedom are consumed (k-1 indicators versus 1 continuous var), but this is not a problem with large data sets. Alternative Coding for Indicators: Alternatives to the coding of a qualitative var with k-1 (0,1) indicators and a constant term are (1) k-1 indicators coded (0,1,-1) and a constant term. This is the same as (0,1) coding except that observations corresponding to the reference class are coded 1 for all the indicators. One can show that the intercept then represents an "average" intercept for the classes. This type of coding is important in ANOVA. (2) k (0,1) indicators with no constant term. The coefficients of the indicators are then interpreted as class-specific intercepts.  NKNW pp.481-2 & Sec 16.11 pp.696-701 discusses the relationship between regr & ANOVA.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download s209 module 1 - UNC Computer Science