Download General Hints for Exam 2

General Hints for Exam 2 36-402, Advanced Data Analysis 14 April 2013 Relationships Between Variables As we’ve been looking at all semester, we use regression or conditional distribution models to describe relationships between variables. The conditional distribution p(Y |X), which tells us about the relationship, can stay the same even when the marginal distributions p(Y ) and p(X) change a lot. Similarly for regression models, which capture the mean of the conditional distribution. When the target variable Y is binary, there is really little difference between regressions and conditional distributions, and either can be used. Conditional Independence |= Conditional independence of groups of variables Remember the definition of conditional independence: X is independent of Y given Z, X Y |Z, when, for all particular values x, y, z, p(X = x, Y = y, Z = z) = p(Y = y)p(X = x|Y = y)p(Z = z|Y = y) Equivalently, either p(X = x|Y = y, Z = z) = P (X = x|Y = y) or p(Z = z|Y = y, X = x) = P (Z = z|Y = y) |= These equations are why one sometimes says that Y “screens off” X from Z. These same definitions carry over unchanged for groups of variables. To be very explicit, suppose that X is really (X1 , X2 , X3 ), that Z is really (Z1 , Z2 ), and that Y is (Y1 , Y2 , Y3 , Y4 ). Then (X1 , X2 , X3 ) (Z1 , Z2 )|(Y1 , Y2 , Y3 , Y4 ) means that p(X1 = x1 , X2 = x2 , X3 = x3 , Z1 = z1 , Z2 = z2 , Y1 = y1 , Y2 = y2 , Y3 = y3 , Y4 = y4 ) = = p(Y1 = y1 , Y2 = y2 , Y3 = y3 , Y4 = y4 ) ×p(X1 = x1 , X2 = x2 , X3 = x3 |Y1 = y1 , Y2 = y2 , Y3 = y3 , Y4 = y4 ) ×p(Z1 = z1 , Z2 = z2 |Y1 = y1 , Y2 = y2 , Y3 = y3 , Y4 = y4 ) 1 |= Checking for conditional independence: predictions Because X Z|Y when p(Z = z|Y = y, X = x) = P (Z = z|Y = y), one way to see whether X and Z are independent given Y is to check whether there is any difference between predicting Z from Y alone, and predicting Z from X and Y . Under conditional independence, the two predictions should match exactly with unlimited data. With limited data, one needs to take account of estimation errors. If a parametric model is appropriate, one could check by estimating the model to predict Z from Y alone, and then estimating again with both Y and X, and seeing that the coefficient on Y hadn’t changed while that on X was zero. Of course, you then need to have a good explanation of why this particular model is appropriate. (Z might still be dependent on X given Y , but in ways the model is blind to.) Whether we use parametric or nonparametric predictors for Z, there is no special difficulty here if Y or X are multivariate. If Z is multivariate, one needs to either do separate models for each coordinate of Z, or some sort of dimension reduction for Z, or do a multivariate prediction, with multiple dependent variables. |= |= Checking for conditional independence: bandwidths As described in the notes (§15.5), if one uses kernels to estimate the conditional distribution p(Z|Y, X) and Z X|Y , then bandwidth of X will tend towards the largest possible value, since it’s irrelevant and should be smoothed away totally. That maximum is ∞ for continuous X, and 1 for discrete X (see §15.4.3 for what kernels look like for discrete data). Checking for conditional independence: factorization Since X Z|Y when p(X, Y, Z) = p(Y )p(X|Y )p(Z|Y ), one way to test independence is to estimate p(Y ), p(X|Y ) and p(Z|Y ), and check that their product is close to the empirical joint probability p(X, Y, Z). You can in fact turn this into a χ2 test, for comparing expected counts or probabilities to observed counts or probabilities. Dimension Reduction/Summarization Summary Variables as Input and Output Variables in Models Summary variables from different sorts of dimension reduction (PCA, factor analysis, mixtures, others we haven’t covered in this class...) can always be used as input variables in regressions (or conditional densities, etc.) to predict something else. They may not always be useful, but they can certainly always be used. Summary variables can also be made the targets of predictions, i.e., be put on the left-hand-side of a model formula. This is like estimating a factor score from the observables. There can be some subtle issues in measuring how accurately the original variables could be reconstructed from predictions of summaries1 , but those should not arise here. 1 Because errors of recovering the original variable from the summary might be correlated with errors in predicting their summary. 2 Using multmixEM The multmixEM function in the mixtools package fits mixture models of categorical data. It is somewhat finicky in that it requires its first argument to be a matrix, not a data frame: history.mix <- multmixEM(as.matrix(paristan[,2:6]),k=2) will learn a two-component mixture model for the historical variables. The share of each data point attributed to each mixture component is stored in the posterior attribute of the output: > head(history.mix$posterior) comp.1 comp.2 [1,] 0.9620296 0.03797044 [2,] 0.9658014 0.03419860 [3,] 0.9658014 0.03419860 [4,] 0.9339390 0.06606105 [5,] 0.9339390 0.06606105 [6,] 0.9339390 0.06606105 This posterior attribute is also part of the output of normalmixEM, and was used in the notes (§19.4.5). PCA and factor analysis for discrete variables Factor analysis (in the form we’ve learned it) is somewhat dubious when the observed variables are categorical. The factor model presumes that what we observe is a linear combination of the factors plus uncorrelated noise. Even if we agree to always label one categorical response “0” and the other “1”, it’s hard to reconcile always getting either 0 or 1 with having uncorrelated noise. There are models where the probability of a 1 for each observable variable is a function of a continuous latent variable (“item response models” in psychometrics), and you’d be welcome to use them here, but they go beyond what you’re expected to know. Because PCA does not make any probabilistic assumptions, but just tries to find the best linear reduction of the variables, it isn’t subject to all the same issues as factor models when used with categorical observations. However, PCA interprets “best” to mean “minimum mean-squared error”, which may or may not be appropriate for categorical variables. In particular, even for binary categories, the exact rotation and scores one gets back from PCA will depend on the numerical codes assigned to the categories. Comparing Samples with Different Numbers of Observations The two “waves” of the survey, in 1998 and 2003, did not survey the same people. In fact, they didn’t even survey the same number of people. It thus makes no sense to compare individuals’ answers to any given question. What does make sense is to compare the proportion of people giving various answers 3 in the two waves. In other words, one can compare the distribution of answers across waves. Predicting Multiple Variables from the Same Inputs Joint Conditional Distributions The npcdens function in the np package is perfectly happy to take multiple variables on the left-hand side of its formula: > npcdens(factor(postsocialist)+factor(socialist)~factor(feudal), + data=paristan,tol=0.1,ftol=0.1) Conditional Density Data: 1283 training points, in 3 variable(s) (2 dependent variable(s), and 1 explanatory variable(s)) factor(postsocialist) factor(socialist) 0.007093843 0.005218663 factor(feudal) Exp. Var. Bandwidth(s): 0.0004363277 Dep. Var. Bandwidth(s): (plus more output I have suppressed). This is one straightforward way of having multiple variables predicted from the same input, if perhaps computationally time-consuming. Automatic formula generation An alternative is to do separate models for each response variable, but build the formulas automatically, repeating the common parts, rather than writing everything out by hand over and over. An example: targets <- c("postsocialist","socialist") formulas <- paste("factor(", targets, ") ~ factor(feudal)", sep="") models <- list() for (form in formulas) { fit <- glm(formula=as.formula(form), data=paristan, family="binomial") models <- c(models,list(fit)) # To adjoin to a list, see help(c) } names(models) <- targets Now the individual models can be accessed either by numerical order (e.g., model[[1]]) or by the name of their target variables (e.g., model$postsocialist). (Exercise: replace the for loop here with a call to lapply.) 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download General Hints for Exam 2