Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STANFORD UNIVERSITY CS 229, Spring 2020 Midterm Examination Question Points 1 Multiple Choice /12 2 MLE for the Laplace Distribution /22 3 Moment Matching in Exponential Families /26 4 EM for Mixture of Poissons /18 5 Batch Normalization /22 Total /100 Name of Student: SUNetID: @stanford.edu The Stanford University Honor Code: I attest that I have not given or received aid in this examination, and that I have done my share and taken an active part in seeing to it that others as well as myself uphold the spirit and letter of the Honor Code. Signed: CS229 Midterm 2 1. [12 points] Multiple Choice For each question except (a), choose the single best answer. For question (a), choose all of the correct answers (there may be multiple correct answers). (a) [2 points] Which of the following are hyperparameters? Select all of the correct answers (there may be multiple correct answers for this problem). (1) (2) (3) (4) The number of hidden units in a two-layer neural network The cluster centroids in the k-means algorithm Pn 1 (i) − hθ (x(i) ))2 + λ||θ||22 λ in the regularized cost function 2n i=1 (y The probability φj in the Naive Bayes classifier (b) [2 points] Which of the following statements is false about neural networks? (1) Without nonlinear activation functions, the hypothesis function hθ (x) of a multi-layer feed-forward neural network is a linear function of the input x just like in a linear model. (2) When we use multilayer feed-forward neural networks to solve a binary classification problem, we can apply mini-batch stochastic gradient descent together with a 0-1 loss, i.e. the per-example loss `(ŷ, y) = 1 {ŷ 6= y} where ŷ ∈ {0, 1} is the prediction of the neural network and y ∈ {0, 1} is the ground truth label. (3) Nonlinear activation functions are usually applied element-wise. (4) ReLU(x) = max {x, 0}. (c) [2 points] Which of the following machine learning algorithms can be kernelized? (1) (2) (3) (4) Linear regression with L2 regularization on the weights but not perceptron Perceptron but not linear regression with L2 regularization on the weights Both linear regression with L2 regularization on the weights and perceptron Neither linear regression with L2 regularization on the weights nor perceptron (d) [2 points] Suppose we have a training set of n examples. Let X be the n × d design matrix, where each row consists of the input features of an example, and let ~y be the n-dimensional vector consisting of the corresponding target values. Let θ be the d-dimensional parameter vector. The least squares loss that we have seen in class is J(θ) = 21 (Xθ − ~y )T (Xθ − ~y ). Now consider the weighted least squares loss, which can be formulated as J(θ) = 21 (Xθ − ~y )T W (Xθ − ~y ) where W is a n × n diagonal matrix with the weights for each example on the diagonal. What is the normal equation for weighted least squares? (1) (2) (3) (4) X T W Xθ = X T ~y X T Xθ = X T W ~y X T W Xθ = X T W ~y A normal equation does not exist for weighted least squares (e) [2 points] Let’s consider Logistic Regression and Gaussian Discriminant Analysis for binary classification. Which of the following statements is true? CS229 Midterm 3 (1) Logistic Regression makes no modeling assumptions (2) Gaussian Discriminant Analysis makes no modeling assumptions (3) Logistic Regression makes stronger modeling assumptions than Gaussian Discriminant Analysis (4) Gaussian Discriminant Analysis makes stronger modeling assumptions than Logistic Regression (f) [2 points] Note: this question is somewhat harder than the others. Suppose we have two non-overlapping datasets a and b, where all the examples in a and b are distinct. Concretely, we have na examples in dataset a with input (i) (i) xa ∈ R and target ya ∈ R for i ∈ {1, 2, . . . , na }, and we have nb examples in (i) (i) dataset b with input xb ∈ R and target yb ∈ R for i ∈ {1, 2, . . . , nb }. When we perform linear regression on dataset a to learn the parameters βa ∈ R and 2 P (i) (i) a ya − βa xa − γa , γa ∈ R by minimizing the cost function Ja (βa , γa ) = 12 ni=1 we get β̂a > 0. When 2 regression on dataset b by minimizing we perform linear P (i) (i) b Jb (βb , γb ) = 12 ni=1 yb − βb xb − γb , we get β̂b > 0. If we combine the i h (1) (na ) (1) (nb ) two datasets a and b to get dataset c with inputs xa , . . . , xa , xb , . . . , xb i h (1) (n ) (1) (n ) and targets ya , . . . ya a , yb , . . . , yb b , and we perform linear regression on the 2 Pna +nb (i) (i) yc − βc xc − γc , combined dataset c by minimizing Jc (βc , γc ) = 21 i=1 what is the sign of β̂c ? (1) Positive (2) Negative (3) It depends on the actual dataset CS229 Midterm 4 2. [22 points] MLE for the Laplace Distribution We start by defining and considering the Laplace distribution. The Laplace distribution parametrized by b > 0, which we denote by Laplace(b), has the probability density function 1 |y| p(y; b) = exp − 2b b for y ∈ R. Both the Gaussian distribution and the Laplace distribution are continuous distributions which can produce values in (−∞, ∞), and the forms of their probability density functions are similar with the parameter b playing a role similar to σ for the Gaussian distribution. However, the Laplace distribution has “fatter” tails than the Gaussian: the tails decay more slowly for the Laplace distribution. Here is a plot of the Laplace distribution’s probability density function: (a) [4 points] As a warmup, show that the distribution Laplace(b) is a member of the exponential family. Recall that distributions from the exponential family have probability density functions of the form: p(y; η) = b(y) exp η T T (y) − a(η) Clearly specify what b(y), η, T (y) and a(η) should be. CS229 Midterm 5 (b) [12 points] Suppose we observe n training examples (x(i) , y (i) ), with x(i) ∈ R and y (i) ∈ {0, 1} for i ∈ {1, ..., n}, that are assumed to be drawn i.i.d. according to the following model1 : y (i) ∼ Bernoulli(φ) x(i) | y (i) = 0 ∼ Laplace(b0 ) x(i) | y (i) = 1 ∼ Laplace(b1 ) where b0 , b1 > 0 and φ ∈ [0, 1] are unknown parameters. Recall that the Bernoulli distribution with mean φ, denoted by Bernoulli(φ), specifies a distribution over y ∈ {0, 1}, so that p(y = 1; φ) = φ and p(y = 0; φ) = 1 − φ. Find the Maximum Likelihood Estimates of φ, b0 and b1 by maximizing the joint Qn likelihood L(φ, b0 , b1 ) = i=1 p(x(i) , y (i) ; φ, b0 , b1 ). You can use the indicator functions 1{y (i) = 0} and 1{y (i) = 1} in your answer. 1 Note that the Laplace probability density was defined as a function of y on the previous page. In this problem, x(i) | y (i) is drawn from the Laplace distribution. CS229 Midterm 6 (c) [6 points] Now that we’ve learned the parameters φ, b0 , and b1 , we will use them to predict y given a new x. Show that p(y = 1|x; φ, b0 , b1 ) = 1 1 + exp(−f (|x|)) where f (|x|) is some function of |x|. Assume b0 6= b1 . Explicitly specify what f (|x|) is. Note that your answer for f (|x|) should depend on |x|, φ, b0 , and b1 . You do NOT need to plug in the estimates for φ, b0 , and b1 that you found in part (b). CS229 Midterm 7 3. [26 points] Moment Matching in Exponential Families We learned in class that an exponential family distribution is one whose probability density can be represented as p(y; η) = b(y) exp(η T T (y) − a(η)), where η is the natural parameter of the distribution, T (y) is the sufficient statistic, a(η) is the log partition function, and b(y) is the base measure. (a) [8 points] Suppose we are given N data points y (1) , y (2) , . . . , y (N ) that are drawn i.i.d. from the exponential family distribution p(y; η). The log-likelihood of the data is ! N Y `(η) = log p(y (i) ; η) . (1) i=1 Show that the log-likelihood of the data, `(η), is maximized with respect to the natural parameter η when N 1 X E[T (Y ); η] = T (y (i) ) N i=1 (2) In other words, you are asked to show that the maximizer of equation (1) satisfies equation (2). Note that this equation can be interpreted as a moment-matching condition: the expected sufficient statistics equals the R observed sufficient statistics. The expected sufficient statistic E[T (Y ); η] , T (y)p(y; η)dy is the expectation of T (Y ) when Y is distributed according to p(y; η). Hint: The solution to one of the questions in Problem Set 1 may be helpful for this problem. CS229 Midterm 8 (b) [8 points] Consider the multivariate Gaussian distribution N (µ, Σ) in d-dimensions with mean vector µ and covariance matrix Σ. The probability density function is given by 1 1 T −1 p(y; µ, Σ) = exp − (y − µ) Σ (y − µ) . (2π)d/2 |Σ|1/2 2 Show that the multivariate Gaussian distribution is in the exponential family, and clearly state the values for η and T (y). You do not have to state the values for b(y) and a(η), and you may assume the existence of b(y) and a(η). Hint: The following hint is useful in one way of solving the question. Let A and 2 B be two n × n matrices, and let vec(A) ∈ Rn denote the column vector which is the matrix A “flattened”, i.e. the elements of vec(A) are AP 1 ≤ i ≤ n, 1 ≤ i,j forP n j ≤ n. You may find one of the following identities helpful: i=1 nj=1 Ai,j Bi,j = vec(A)T vec(B) and trace(AB) = vec(A)T vec(B T ). CS229 Midterm 9 (c) [10 points] Suppose we are given N data points y (1) , y (2) , . . . , y (N ) that are drawn i.i.d. from the multivariate Gaussian distribution N (µ, Σ) in d-dimensions with mean vector µ and covariance matrix Σ. We learned in class that for Y ∼ N (µ, Σ), E[Y ] = µ and E[(Y − µ)(Y − µ)T ] = Σ. Using these two properties of Gaussians and the results in parts (a) and (b), show that the maximum likelihood estimates of the parameters µ and Σ are given by N 1 X (i) µ= y N i=1 N 1 X (i) Σ= (y − µ)(y (i) − µ)T . N i=1 Note that this problem shows another way to derive the MLE without explicitly computing derivatives of the log-likelihood with respect to µ and Σ. CS229 Midterm 10 4. [18 points] EM for Mixture of Poissons Suppose we have training data consisting of n independent, unlabeled examples x(1) , x(2) , . . . , x(n) . Each observed sample x(i) ∈ N0 = {0, 1, 2, 3, . . . } follows a Poisson distribution with parameter λz(i) , where z (i) ∈ {1, 2, . . . , k} is the corresponding latent (unobserved) variable indicating which Poisson distribution x(i) belongs to. P Specifically, we have z (i) ∼ Multinomial (φ) (where kj=1 φj = 1, φj ≥ 0 for all j, and the parameter φj gives p(z (i) = j)), and x(i) | z (i) = j ∼ Poisson (λj ). In brief, the training data follows a distribution consisting of a mixture of k Poissons. Recall that at iteration t, the Expectation-Maximization (EM) Algorithm discussed in lecture can be split into the E-step and M-step as follows: • (E-step): For each i, j, set (i) wj := p z (i) = j | x(i) ; θ(t) • (M-step): Set θ(t+1) := argmax θ n X k X (i) wj i=1 j=1 log p x(i) , z (i) = j; θ (i) wj Here θ = (φ, λ) represents all parameters that we need to estimate. (a) [2 points] Show that the M-step is equivalent to the following: θ (t+1) := argmax θ n X k X i=1 j=1 (i) wj log p x(i) , z (i) = j; θ CS229 Midterm 11 (i) (b) [4 points] Complete the E-step by finding an expression for wj in terms of φ(t) , h iT λ(t) and x(i) . Further, show that we can get w1(i) . . . wk(i) ∈ Rk by applying the softmax function to some k-dimensional vector, and specify what this vector is in terms of φ(t) , λ(t) and x(i) . Recall that if X follows a Poisson distribution with parameter λ, then P (X = x; λ) = e−λ λx x! CS229 Midterm 12 (t+1) (c) [12 points] Complete the M-step by finding the expressions for φj through solving θ(t+1) := argmax θ n X k X i=1 j=1 (i) wj log p x(i) , z (i) = j; θ (t+1) and λj CS229 Midterm 13 5. [22 points] Batch Normalization Machine Learning Algorithms tend to give better results when input features are normalized and uncorrelated. For this reason, data normalization is a common preprocessing step in deep learning. However, the distribution of each layer’s inputs may also vary over time and across layers, sometimes making training deep networks more difficult. Batch Normalization is a method that draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Given a batch B = (x(1) , ..., x(m) ) of vectors in Rd , we first normalize inputs into (x̂(1) , ..., x̂(m) ) and then linearly transform them into (y (1) , ..., y (m) ). A Batch Normalization layer also has parameters γ ∈ Rd and β ∈ Rd . The layer works as follows: Figure 1: Batch Normalization First, compute the batch mean vector m µ(B) = 1 X (i) x ∈ Rd m i=1 (3) Then, compute the batch variance vector v (B) ∈ Rd , which is defined element-wise for any j ∈ {1, ..., d} by: m 1 X (i) (B) (B) vj = (xj − µj )2 ∈ R (4) m i=1 Then, normalize the vectors in B by computing vectors (x̂(1) , ..., x̂(m) ) such that x̂(i) ∈ Rd is defined element-wise for any j ∈ {1, ..., d} by: (i) (i) x̂j (B) xj − µ j = q (B) vj (5) Finally, output vectors (y (1) , ..., y (m) ) in Rd defined by y (i) = γ where x̂(i) + β is the element-wise vector multiplication. (6) CS229 Midterm 14 In this question, you will derive the back-propagation rules for batch-normalization. Let L = L(y (1) , . . . , y (m) ) be some scalar function of the batch-normalization output vectors. We will calculate the gradients of L with respect to the parameters and some intermediate variables. (a) [6 points] Calculate the gradient of L w.r.t. β, γ, and x̂(i) for i ∈ {1, ..., m}; i.e. calculate ∂L , ∂L , and ∂∂L for i ∈ {1, ..., m}. Your answer can and should depend ∂β ∂γ x̂(i) ∂L on ∂y(i) and the parameters and variables in the forward pass (such as x̂(i) ’s). Your ∂L answer may be vectorized or un-vectorized (e.g. ∂L or ∂β for j ∈ {1, ..., d}). ∂β j CS229 Midterm 15 (b) [4 points] Calculate the gradient of L w.r.t. v (B) ; i.e. calculate ∂v∂L (B) . Your answer can depend on ∂y∂L(i) and the parameters and variables in the forward pass, as well as the gradients that you calculated in part (a). Your answer may be vectorized ∂L or un-vectorized ( ∂v∂L (B) for j ∈ {1, ..., d}). (B) or ∂vj CS229 Midterm 16 (c) [6 points] Show that the gradient of L w.r.t. µ(B) is ∂L −1 √ = ∂µ(B) v (B) m X ∂L ∂ x̂(i) i=1 Please show all of your work. (i) Hint: Consider applying the chain rule using x̂j for i ∈ {1, ..., m} as the inter(i) (B) (B) mediate variables, and note that x̂j depends on µj directly through µj and (B) indirectly through vj . CS229 Midterm 17 (d) [6 points] Show that the gradient of L w.r.t. x(i) for i ∈ {1, ..., m} is ∂L 1 =√ (i) ∂x v (B) ∂L 2(x(i) − µ(B) ) + ∂ x̂(i) m ∂L 1 ∂L + (B) ∂v m ∂µ(B) Please show all of your work. (k) Hint: Consider applying the chain rule using x̂j for k ∈ {1, ..., m} as the (k) (i) intermediate variables, and note that x̂j for k ∈ {1, ..., m} depends on xj (i) (B) (B) directly through xj for k = i and indirectly through µj and vj for k ∈ {1, ..., m}. Remarks on the broader context: After obtaining ∂x∂L(i) (as a function of ∂L and other quantities known in the forward pass), one can propagate the ∂y (i) gradient backwards to other layers that generated the x(i) ’s (you are not asked to show this). Empirically, it turns out to be important to consider µ(B) and v (B) as variables (instead of constants), so that in the chain rule, we consider the gradient through µ(B) and v (B) . CS229 Midterm 18 That’s all! Congratulations on completing the midterm exam!