Download 2020 spring midterm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
STANFORD UNIVERSITY
CS 229, Spring 2020
Midterm Examination
Question
Points
1 Multiple Choice
/12
2 MLE for the Laplace Distribution
/22
3 Moment Matching in Exponential Families
/26
4 EM for Mixture of Poissons
/18
5 Batch Normalization
/22
Total
/100
Name of Student:
SUNetID:
@stanford.edu
The Stanford University Honor Code:
I attest that I have not given or received aid in this examination,
and that I have done my share and taken an active part in seeing
to it that others as well as myself uphold the spirit and letter of the
Honor Code.
Signed:
CS229 Midterm
2
1. [12 points] Multiple Choice
For each question except (a), choose the single best answer. For question (a), choose
all of the correct answers (there may be multiple correct answers).
(a) [2 points] Which of the following are hyperparameters? Select all of the correct
answers (there may be multiple correct answers for this problem).
(1)
(2)
(3)
(4)
The number of hidden units in a two-layer neural network
The cluster centroids in the k-means algorithm
Pn
1
(i)
− hθ (x(i) ))2 + λ||θ||22
λ in the regularized cost function 2n
i=1 (y
The probability φj in the Naive Bayes classifier
(b) [2 points] Which of the following statements is false about neural networks?
(1) Without nonlinear activation functions, the hypothesis function hθ (x) of a
multi-layer feed-forward neural network is a linear function of the input x
just like in a linear model.
(2) When we use multilayer feed-forward neural networks to solve a binary classification problem, we can apply mini-batch stochastic gradient descent together with a 0-1 loss, i.e. the per-example loss `(ŷ, y) = 1 {ŷ 6= y} where
ŷ ∈ {0, 1} is the prediction of the neural network and y ∈ {0, 1} is the ground
truth label.
(3) Nonlinear activation functions are usually applied element-wise.
(4) ReLU(x) = max {x, 0}.
(c) [2 points] Which of the following machine learning algorithms can be kernelized?
(1)
(2)
(3)
(4)
Linear regression with L2 regularization on the weights but not perceptron
Perceptron but not linear regression with L2 regularization on the weights
Both linear regression with L2 regularization on the weights and perceptron
Neither linear regression with L2 regularization on the weights nor perceptron
(d) [2 points] Suppose we have a training set of n examples. Let X be the n × d
design matrix, where each row consists of the input features of an example, and
let ~y be the n-dimensional vector consisting of the corresponding target values.
Let θ be the d-dimensional parameter vector. The least squares loss that we have
seen in class is J(θ) = 21 (Xθ − ~y )T (Xθ − ~y ). Now consider the weighted least
squares loss, which can be formulated as J(θ) = 21 (Xθ − ~y )T W (Xθ − ~y ) where
W is a n × n diagonal matrix with the weights for each example on the diagonal.
What is the normal equation for weighted least squares?
(1)
(2)
(3)
(4)
X T W Xθ = X T ~y
X T Xθ = X T W ~y
X T W Xθ = X T W ~y
A normal equation does not exist for weighted least squares
(e) [2 points] Let’s consider Logistic Regression and Gaussian Discriminant Analysis
for binary classification. Which of the following statements is true?
CS229 Midterm
3
(1) Logistic Regression makes no modeling assumptions
(2) Gaussian Discriminant Analysis makes no modeling assumptions
(3) Logistic Regression makes stronger modeling assumptions than Gaussian Discriminant Analysis
(4) Gaussian Discriminant Analysis makes stronger modeling assumptions than
Logistic Regression
(f) [2 points] Note: this question is somewhat harder than the others.
Suppose we have two non-overlapping datasets a and b, where all the examples
in a and b are distinct. Concretely, we have na examples in dataset a with input
(i)
(i)
xa ∈ R and target ya ∈ R for i ∈ {1, 2, . . . , na }, and we have nb examples in
(i)
(i)
dataset b with input xb ∈ R and target yb ∈ R for i ∈ {1, 2, . . . , nb }. When
we perform linear regression on dataset a to learn the parameters
βa ∈ R and
2
P
(i)
(i)
a
ya − βa xa − γa ,
γa ∈ R by minimizing the cost function Ja (βa , γa ) = 12 ni=1
we get β̂a > 0. When
2 regression on dataset b by minimizing
we perform linear
P
(i)
(i)
b
Jb (βb , γb ) = 12 ni=1
yb − βb xb − γb
, we get β̂b > 0. If we combine the
i
h
(1)
(na )
(1)
(nb )
two datasets a and b to get dataset c with inputs xa , . . . , xa , xb , . . . , xb
i
h
(1)
(n ) (1)
(n )
and targets ya , . . . ya a , yb , . . . , yb b , and we perform linear regression on the
2
Pna +nb (i)
(i)
yc − βc xc − γc ,
combined dataset c by minimizing Jc (βc , γc ) = 21 i=1
what is the sign of β̂c ?
(1) Positive
(2) Negative
(3) It depends on the actual dataset
CS229 Midterm
4
2. [22 points] MLE for the Laplace Distribution
We start by defining and considering the Laplace distribution. The Laplace distribution parametrized by b > 0, which we denote by Laplace(b), has the probability
density function
1
|y|
p(y; b) =
exp −
2b
b
for y ∈ R.
Both the Gaussian distribution and the Laplace distribution are continuous distributions which can produce values in (−∞, ∞), and the forms of their probability
density functions are similar with the parameter b playing a role similar to σ for the
Gaussian distribution. However, the Laplace distribution has “fatter” tails than the
Gaussian: the tails decay more slowly for the Laplace distribution. Here is a plot of
the Laplace distribution’s probability density function:
(a) [4 points] As a warmup, show that the distribution Laplace(b) is a member of the
exponential family. Recall that distributions from the exponential family have
probability density functions of the form:
p(y; η) = b(y) exp η T T (y) − a(η)
Clearly specify what b(y), η, T (y) and a(η) should be.
CS229 Midterm
5
(b) [12 points] Suppose we observe n training examples (x(i) , y (i) ), with x(i) ∈ R and
y (i) ∈ {0, 1} for i ∈ {1, ..., n}, that are assumed to be drawn i.i.d. according to
the following model1 :
y (i) ∼ Bernoulli(φ)
x(i) | y (i) = 0 ∼ Laplace(b0 )
x(i) | y (i) = 1 ∼ Laplace(b1 )
where b0 , b1 > 0 and φ ∈ [0, 1] are unknown parameters. Recall that the Bernoulli
distribution with mean φ, denoted by Bernoulli(φ), specifies a distribution over
y ∈ {0, 1}, so that p(y = 1; φ) = φ and p(y = 0; φ) = 1 − φ.
Find the Maximum Likelihood
Estimates of φ, b0 and b1 by maximizing the joint
Qn
likelihood L(φ, b0 , b1 ) = i=1 p(x(i) , y (i) ; φ, b0 , b1 ).
You can use the indicator functions 1{y (i) = 0} and 1{y (i) = 1} in your answer.
1
Note that the Laplace probability density was defined as a function of y on the previous page. In this
problem, x(i) | y (i) is drawn from the Laplace distribution.
CS229 Midterm
6
(c) [6 points] Now that we’ve learned the parameters φ, b0 , and b1 , we will use them
to predict y given a new x. Show that
p(y = 1|x; φ, b0 , b1 ) =
1
1 + exp(−f (|x|))
where f (|x|) is some function of |x|. Assume b0 6= b1 . Explicitly specify what
f (|x|) is. Note that your answer for f (|x|) should depend on |x|, φ, b0 , and b1 .
You do NOT need to plug in the estimates for φ, b0 , and b1 that you found in
part (b).
CS229 Midterm
7
3. [26 points] Moment Matching in Exponential Families
We learned in class that an exponential family distribution is one whose probability
density can be represented as
p(y; η) = b(y) exp(η T T (y) − a(η)),
where η is the natural parameter of the distribution, T (y) is the sufficient statistic,
a(η) is the log partition function, and b(y) is the base measure.
(a) [8 points] Suppose we are given N data points y (1) , y (2) , . . . , y (N ) that are drawn
i.i.d. from the exponential family distribution p(y; η). The log-likelihood of the
data is
!
N
Y
`(η) = log
p(y (i) ; η) .
(1)
i=1
Show that the log-likelihood of the data, `(η), is maximized with respect to the
natural parameter η when
N
1 X
E[T (Y ); η] =
T (y (i) )
N i=1
(2)
In other words, you are asked to show that the maximizer of equation (1) satisfies
equation (2).
Note that this equation can be interpreted as a moment-matching condition:
the expected sufficient statistics equals the
R observed sufficient statistics. The
expected sufficient statistic E[T (Y ); η] , T (y)p(y; η)dy is the expectation of
T (Y ) when Y is distributed according to p(y; η).
Hint: The solution to one of the questions in Problem Set 1 may be helpful for
this problem.
CS229 Midterm
8
(b) [8 points] Consider the multivariate Gaussian distribution N (µ, Σ) in d-dimensions
with mean vector µ and covariance matrix Σ. The probability density function
is given by
1
1
T −1
p(y; µ, Σ) =
exp − (y − µ) Σ (y − µ) .
(2π)d/2 |Σ|1/2
2
Show that the multivariate Gaussian distribution is in the exponential family,
and clearly state the values for η and T (y). You do not have to state the values
for b(y) and a(η), and you may assume the existence of b(y) and a(η).
Hint: The following hint is useful in one way of solving the question. Let A and
2
B be two n × n matrices, and let vec(A) ∈ Rn denote the column vector which is
the matrix A “flattened”, i.e. the elements of vec(A) are AP
1 ≤ i ≤ n, 1 ≤
i,j forP
n
j ≤ n. You may find one of the following identities helpful: i=1 nj=1 Ai,j Bi,j =
vec(A)T vec(B) and trace(AB) = vec(A)T vec(B T ).
CS229 Midterm
9
(c) [10 points] Suppose we are given N data points y (1) , y (2) , . . . , y (N ) that are
drawn i.i.d. from the multivariate Gaussian distribution N (µ, Σ) in d-dimensions
with mean vector µ and covariance matrix Σ. We learned in class that for Y ∼
N (µ, Σ), E[Y ] = µ and E[(Y − µ)(Y − µ)T ] = Σ. Using these two properties of
Gaussians and the results in parts (a) and (b), show that the maximum likelihood
estimates of the parameters µ and Σ are given by
N
1 X (i)
µ=
y
N i=1
N
1 X (i)
Σ=
(y − µ)(y (i) − µ)T .
N i=1
Note that this problem shows another way to derive the MLE without explicitly
computing derivatives of the log-likelihood with respect to µ and Σ.
CS229 Midterm
10
4. [18 points] EM for Mixture of Poissons
Suppose we have training data consisting of n independent, unlabeled examples
x(1) , x(2) , . . . , x(n) . Each observed sample x(i) ∈ N0 = {0, 1, 2, 3, . . . } follows a Poisson distribution with parameter λz(i) , where z (i) ∈ {1, 2, . . . , k} is the corresponding
latent (unobserved) variable indicating which Poisson
distribution x(i) belongs to.
P
Specifically, we have z (i) ∼ Multinomial (φ) (where kj=1 φj = 1, φj ≥ 0 for all j, and
the parameter φj gives p(z (i) = j)), and x(i) | z (i) = j ∼ Poisson (λj ). In brief, the
training data follows a distribution consisting of a mixture of k Poissons.
Recall that at iteration t, the Expectation-Maximization (EM) Algorithm discussed
in lecture can be split into the E-step and M-step as follows:
• (E-step): For each i, j, set
(i)
wj := p z (i) = j | x(i) ; θ(t)
• (M-step): Set
θ(t+1) := argmax
θ
n X
k
X
(i)
wj
i=1 j=1
log
p x(i) , z (i) = j; θ
(i)
wj
Here θ = (φ, λ) represents all parameters that we need to estimate.
(a) [2 points] Show that the M-step is equivalent to the following:
θ
(t+1)
:= argmax
θ
n X
k
X
i=1 j=1
(i)
wj log p x(i) , z (i) = j; θ
CS229 Midterm
11
(i)
(b) [4 points] Complete the E-step by finding an expression for wj in terms of φ(t) ,
h
iT
λ(t) and x(i) . Further, show that we can get w1(i) . . . wk(i) ∈ Rk by applying
the softmax function to some k-dimensional vector, and specify what this vector
is in terms of φ(t) , λ(t) and x(i) .
Recall that if X follows a Poisson distribution with parameter λ, then
P (X = x; λ) = e−λ
λx
x!
CS229 Midterm
12
(t+1)
(c) [12 points] Complete the M-step by finding the expressions for φj
through solving
θ(t+1) := argmax
θ
n X
k
X
i=1 j=1
(i)
wj log p x(i) , z (i) = j; θ
(t+1)
and λj
CS229 Midterm
13
5. [22 points] Batch Normalization
Machine Learning Algorithms tend to give better results when input features are
normalized and uncorrelated. For this reason, data normalization is a common preprocessing step in deep learning. However, the distribution of each layer’s inputs
may also vary over time and across layers, sometimes making training deep networks
more difficult. Batch Normalization is a method that draws its strength from making
normalization a part of the model architecture and performing the normalization for
each training mini-batch.
Given a batch B = (x(1) , ..., x(m) ) of vectors in Rd , we first normalize inputs into
(x̂(1) , ..., x̂(m) ) and then linearly transform them into (y (1) , ..., y (m) ). A Batch Normalization layer also has parameters γ ∈ Rd and β ∈ Rd . The layer works as follows:
Figure 1: Batch Normalization
First, compute the batch mean vector
m
µ(B) =
1 X (i)
x ∈ Rd
m i=1
(3)
Then, compute the batch variance vector v (B) ∈ Rd , which is defined element-wise for
any j ∈ {1, ..., d} by:
m
1 X (i)
(B)
(B)
vj =
(xj − µj )2 ∈ R
(4)
m i=1
Then, normalize the vectors in B by computing vectors (x̂(1) , ..., x̂(m) ) such that x̂(i) ∈
Rd is defined element-wise for any j ∈ {1, ..., d} by:
(i)
(i)
x̂j
(B)
xj − µ j
= q
(B)
vj
(5)
Finally, output vectors (y (1) , ..., y (m) ) in Rd defined by
y (i) = γ
where
x̂(i) + β
is the element-wise vector multiplication.
(6)
CS229 Midterm
14
In this question, you will derive the back-propagation rules for batch-normalization.
Let L = L(y (1) , . . . , y (m) ) be some scalar function of the batch-normalization output
vectors. We will calculate the gradients of L with respect to the parameters and some
intermediate variables.
(a) [6 points] Calculate the gradient of L w.r.t. β, γ, and x̂(i) for i ∈ {1, ..., m}; i.e.
calculate ∂L
, ∂L , and ∂∂L
for i ∈ {1, ..., m}. Your answer can and should depend
∂β ∂γ
x̂(i)
∂L
on ∂y(i) and the parameters and variables in the forward pass (such as x̂(i) ’s). Your
∂L
answer may be vectorized or un-vectorized (e.g. ∂L
or ∂β
for j ∈ {1, ..., d}).
∂β
j
CS229 Midterm
15
(b) [4 points] Calculate the gradient of L w.r.t. v (B) ; i.e. calculate ∂v∂L
(B) . Your answer
can depend on ∂y∂L(i) and the parameters and variables in the forward pass, as well
as the gradients that you calculated in part (a). Your answer may be vectorized
∂L
or un-vectorized ( ∂v∂L
(B) for j ∈ {1, ..., d}).
(B) or
∂vj
CS229 Midterm
16
(c) [6 points] Show that the gradient of L w.r.t. µ(B) is
∂L
−1
√
=
∂µ(B)
v (B)
m
X
∂L
∂ x̂(i)
i=1
Please show all of your work.
(i)
Hint: Consider applying the chain rule using x̂j for i ∈ {1, ..., m} as the inter(i)
(B)
(B)
mediate variables, and note that x̂j depends on µj directly through µj and
(B)
indirectly through vj .
CS229 Midterm
17
(d) [6 points] Show that the gradient of L w.r.t. x(i) for i ∈ {1, ..., m} is
∂L
1
=√
(i)
∂x
v (B)
∂L
2(x(i) − µ(B) )
+
∂ x̂(i)
m
∂L
1 ∂L
+
(B)
∂v
m ∂µ(B)
Please show all of your work.
(k)
Hint: Consider applying the chain rule using x̂j for k ∈ {1, ..., m} as the
(k)
(i)
intermediate variables, and note that x̂j for k ∈ {1, ..., m} depends on xj
(i)
(B)
(B)
directly through xj for k = i and indirectly through µj and vj for k ∈
{1, ..., m}.
Remarks on the broader context: After obtaining ∂x∂L(i) (as a function of
∂L
and other quantities known in the forward pass), one can propagate the
∂y (i)
gradient backwards to other layers that generated the x(i) ’s (you are not asked
to show this). Empirically, it turns out to be important to consider µ(B) and
v (B) as variables (instead of constants), so that in the chain rule, we consider the
gradient through µ(B) and v (B) .
CS229 Midterm
18
That’s all! Congratulations on completing the midterm exam!