Download General Hints for Exam 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Regression analysis wikipedia , lookup

Least squares wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
General Hints for Exam 2
36-402, Advanced Data Analysis
14 April 2013
Relationships Between Variables
As we’ve been looking at all semester, we use regression or conditional distribution models to describe relationships between variables. The conditional
distribution p(Y |X), which tells us about the relationship, can stay the same
even when the marginal distributions p(Y ) and p(X) change a lot. Similarly
for regression models, which capture the mean of the conditional distribution.
When the target variable Y is binary, there is really little difference between
regressions and conditional distributions, and either can be used.
Conditional Independence
|=
Conditional independence of groups of variables Remember the definition of conditional independence: X is independent of Y given Z, X Y |Z,
when, for all particular values x, y, z,
p(X = x, Y = y, Z = z) = p(Y = y)p(X = x|Y = y)p(Z = z|Y = y)
Equivalently, either
p(X = x|Y = y, Z = z) = P (X = x|Y = y)
or
p(Z = z|Y = y, X = x) = P (Z = z|Y = y)
|=
These equations are why one sometimes says that Y “screens off” X from Z.
These same definitions carry over unchanged for groups of variables. To be
very explicit, suppose that X is really (X1 , X2 , X3 ), that Z is really (Z1 , Z2 ), and
that Y is (Y1 , Y2 , Y3 , Y4 ). Then (X1 , X2 , X3 ) (Z1 , Z2 )|(Y1 , Y2 , Y3 , Y4 ) means
that
p(X1 = x1 , X2 = x2 , X3 = x3 , Z1 = z1 , Z2 = z2 , Y1 = y1 , Y2 = y2 , Y3 = y3 , Y4 = y4 ) =
= p(Y1 = y1 , Y2 = y2 , Y3 = y3 , Y4 = y4 )
×p(X1 = x1 , X2 = x2 , X3 = x3 |Y1 = y1 , Y2 = y2 , Y3 = y3 , Y4 = y4 )
×p(Z1 = z1 , Z2 = z2 |Y1 = y1 , Y2 = y2 , Y3 = y3 , Y4 = y4 )
1
|=
Checking for conditional independence: predictions Because X Z|Y
when p(Z = z|Y = y, X = x) = P (Z = z|Y = y), one way to see whether X and
Z are independent given Y is to check whether there is any difference between
predicting Z from Y alone, and predicting Z from X and Y . Under conditional
independence, the two predictions should match exactly with unlimited data.
With limited data, one needs to take account of estimation errors.
If a parametric model is appropriate, one could check by estimating the
model to predict Z from Y alone, and then estimating again with both Y and
X, and seeing that the coefficient on Y hadn’t changed while that on X was
zero. Of course, you then need to have a good explanation of why this particular
model is appropriate. (Z might still be dependent on X given Y , but in ways
the model is blind to.)
Whether we use parametric or nonparametric predictors for Z, there is no
special difficulty here if Y or X are multivariate. If Z is multivariate, one needs
to either do separate models for each coordinate of Z, or some sort of dimension reduction for Z, or do a multivariate prediction, with multiple dependent
variables.
|=
|=
Checking for conditional independence: bandwidths As described in
the notes (§15.5), if one uses kernels to estimate the conditional distribution
p(Z|Y, X) and Z X|Y , then bandwidth of X will tend towards the largest
possible value, since it’s irrelevant and should be smoothed away totally. That
maximum is ∞ for continuous X, and 1 for discrete X (see §15.4.3 for what
kernels look like for discrete data).
Checking for conditional independence: factorization Since X Z|Y
when p(X, Y, Z) = p(Y )p(X|Y )p(Z|Y ), one way to test independence is to estimate p(Y ), p(X|Y ) and p(Z|Y ), and check that their product is close to the
empirical joint probability p(X, Y, Z). You can in fact turn this into a χ2 test, for
comparing expected counts or probabilities to observed counts or probabilities.
Dimension Reduction/Summarization
Summary Variables as Input and Output Variables in Models Summary variables from different sorts of dimension reduction (PCA, factor analysis,
mixtures, others we haven’t covered in this class...) can always be used as input
variables in regressions (or conditional densities, etc.) to predict something else.
They may not always be useful, but they can certainly always be used.
Summary variables can also be made the targets of predictions, i.e., be put on
the left-hand-side of a model formula. This is like estimating a factor score from
the observables. There can be some subtle issues in measuring how accurately
the original variables could be reconstructed from predictions of summaries1 ,
but those should not arise here.
1 Because errors of recovering the original variable from the summary might be correlated
with errors in predicting their summary.
2
Using multmixEM The multmixEM function in the mixtools package fits mixture models of categorical data. It is somewhat finicky in that it requires its
first argument to be a matrix, not a data frame:
history.mix <- multmixEM(as.matrix(paristan[,2:6]),k=2)
will learn a two-component mixture model for the historical variables. The
share of each data point attributed to each mixture component is stored in the
posterior attribute of the output:
> head(history.mix$posterior)
comp.1
comp.2
[1,] 0.9620296 0.03797044
[2,] 0.9658014 0.03419860
[3,] 0.9658014 0.03419860
[4,] 0.9339390 0.06606105
[5,] 0.9339390 0.06606105
[6,] 0.9339390 0.06606105
This posterior attribute is also part of the output of normalmixEM, and was
used in the notes (§19.4.5).
PCA and factor analysis for discrete variables Factor analysis (in the
form we’ve learned it) is somewhat dubious when the observed variables are
categorical. The factor model presumes that what we observe is a linear combination of the factors plus uncorrelated noise. Even if we agree to always label
one categorical response “0” and the other “1”, it’s hard to reconcile always
getting either 0 or 1 with having uncorrelated noise. There are models where
the probability of a 1 for each observable variable is a function of a continuous latent variable (“item response models” in psychometrics), and you’d be
welcome to use them here, but they go beyond what you’re expected to know.
Because PCA does not make any probabilistic assumptions, but just tries
to find the best linear reduction of the variables, it isn’t subject to all the
same issues as factor models when used with categorical observations. However,
PCA interprets “best” to mean “minimum mean-squared error”, which may or
may not be appropriate for categorical variables. In particular, even for binary
categories, the exact rotation and scores one gets back from PCA will depend
on the numerical codes assigned to the categories.
Comparing Samples with Different Numbers of
Observations
The two “waves” of the survey, in 1998 and 2003, did not survey the same
people. In fact, they didn’t even survey the same number of people. It thus
makes no sense to compare individuals’ answers to any given question. What
does make sense is to compare the proportion of people giving various answers
3
in the two waves. In other words, one can compare the distribution of answers
across waves.
Predicting Multiple Variables from the Same Inputs
Joint Conditional Distributions The npcdens function in the np package
is perfectly happy to take multiple variables on the left-hand side of its formula:
> npcdens(factor(postsocialist)+factor(socialist)~factor(feudal),
+ data=paristan,tol=0.1,ftol=0.1)
Conditional Density Data: 1283 training points, in 3 variable(s)
(2 dependent variable(s), and 1 explanatory variable(s))
factor(postsocialist) factor(socialist)
0.007093843
0.005218663
factor(feudal)
Exp. Var. Bandwidth(s):
0.0004363277
Dep. Var. Bandwidth(s):
(plus more output I have suppressed). This is one straightforward way of having
multiple variables predicted from the same input, if perhaps computationally
time-consuming.
Automatic formula generation An alternative is to do separate models
for each response variable, but build the formulas automatically, repeating the
common parts, rather than writing everything out by hand over and over. An
example:
targets <- c("postsocialist","socialist")
formulas <- paste("factor(", targets, ") ~ factor(feudal)", sep="")
models <- list()
for (form in formulas) {
fit <- glm(formula=as.formula(form), data=paristan, family="binomial")
models <- c(models,list(fit)) # To adjoin to a list, see help(c)
}
names(models) <- targets
Now the individual models can be accessed either by numerical order (e.g.,
model[[1]]) or by the name of their target variables (e.g., model$postsocialist).
(Exercise: replace the for loop here with a call to lapply.)
4