Download 第頁共9頁 Machine Learning Final Exam. Student No.: Name: 104/6

第 1 頁共 9 頁 Machine Learning Final Exam. Student No.: Name: 104/6 一、是非題(30%) ( )1. Principal components analysis (PCA) and linear discriminant analysis (LDA) are both supervised dimensionality reduction method. ( )2. Clustering methods find correlations between variables and thus group variables. Dimensionality reduction methods find similarities between instances and thus group instances. ( )3. k-means clustering is a local search procedure, and the final cluster means mi are highly depend on the initial mi. Moreover, the k-means clustering algorithm is used to solve the supervised learning problem. ( )4. Locally linear embedding recovers global nonlinear structure from locally linear fits. Its assumptions are (1) each local patch of the manifold can be approximated linearly. (2) Given enough data, each point can be written as a linear, weighted sum of its neighbors. ( )5. Isomap uses the geodesic distances between all pairs of data points. For neighboring points that are close in the input space, Euclidean distance can be used. For faraway points, geodesic distance is approximated by the sum of the distances between the points along the way over the manifold. ( )6. “Similar inputs have similar outputs” is one assumption of nonparameteric method. ( )7. In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent. ( )8. The impurity measure of a classification tree should be satisfies the following properties: (1)  (1 / 2,1 / 2)   ( p,1  p), for any p  [0,1] , (2)  (0,1)   (1,0)  0 , and (3)  ( p,1  p) is increasing in p on [0,1 / 2] and decreasing in p on [1 / 2,1] ( )9. A decision tree is a hierarchical model using a divided-and-conquer strategy. ( )10. To remove subtrees in a decision tree, postpruning is faster and prepruning is more accurate. ( )11. Being a discriminant-based method, the SVM cares only about the instances close to the boundary and discards those that lie in the interior. ( )12. Entropy in information theory specifies the maximum number of bits needed to encode the classification accuracy of an instance. ( )13. Knowledge of any sort related to the application should be built into the network structure whenever possible. These are called hints. ( )14. In a multilayer perceptron, if the number of hidden units is less than the number of inputs, the first layer performs a dimensionality reduction. ( )15. In SIMD machines, different processors may execute different instructions on different data. 1 第 2 頁共 9 頁二、簡答題 1. (3%) What is the difference between feature selection methods and feature extraction methods? 2. (2%) Can you explain what Isomap is? 3. (3%) What are the differences between the parametric density estimation methods and the nonparametric density estimation methods? 4. (4%) LDA(Linear Discriminant Analysis) is a supervised method for dimensionality reduction for classification problems. What are the assumptions of LDA to fine the transformation matrix w? 2 第 3 頁共 9 頁 5. (4%) Draw two-class, two-dimensional data such that (a) PCA and LDA find the same direction and (b) PCA and LDA find totally different directions. 6. (3%) Please explain the following algorithm. 7. (2%) Condensed Nearest Neighbor algorithm is used to find a subset Z of X that is small and is accurate in classifying X. Please finish the following Condensed Nearest Neighbor algorithm. 3 第 4 頁共 9 頁 8. (3%) Given a two-dimensional dataset as follows, please show the dendrogram(樹狀圖) of the complete-link clustering result. The complete-link distance between two groups Gi and Gj:   d  Gi , G j   r maxs d  x r , x s  where d x r ,x s   j 1 x rj  x sj x Gi , x G j d 9. (3%) Given a k-nearest neighbor density estimate as follows: pˆ ( x)  k 2 Nd k ( x) where d k (x) is the distance to the k-nearest sample, and N is the total sample number. Given the result of the k-nearest neighbor density estimator as follows. What is the value of k? 10. (4%) In nonparametric regression, given a running mean smoother as follows, please finish the graph with h = 1.  bx, x  r gˆ x    bx, x  N t t 1 N t 1 t t 1 if xt is in the same bin with x where b x, xt   0 otherwise   4 第 5 頁共 9 頁 11. (6%) Given a regression tree as follows. (1) Please draw its corresponding regression result. (2) Could you show one rule which is extracted from this regression tree? (3) In this case, what is the meanings of a leaf node and an internal node in a decision tree? 12. (4%) In pairwise separation example as follows, and Hij indicates the hyperplane separate the examples of Ci and the examples of Cj Please decide each region belongs to which class. gij  x|w ij , wij 0   wTij x  wij 0 if x  Ci  0  gij  x     0 if x  C j don't care otherwise  choose Ci if j  i , gij  x   0 5 第 6 頁共 9 頁 13. (6%) Given a Classification tree construction algorithm as follows. where K n I m   pmi log 2 pmi (eq. 9.3) and I'm   i 1 j 1 N mj Nm K p i 1 i mj i log 2 pmj (eq. 9.8) Can you explain what the functions “GenerateTree” and “SplitAttribute” do? 6 第 7 頁共 9 頁 14. (3%) Please assign the weights of the multilayer perceptron to solve the following problem. 15. (3%) In neural network, can we have more than one hidden layers? Why or why not? 16. (3%) Why is a neural network overtraining (or overfitting)? 17. (4%) (1) What are support vectors in support vector machine? (2) Given an example as follows. Please show the supports vectors. 7 第 8 頁共 9 頁三、計算證明題 1. (5%) Using principal components analysis, we can find a low-dimensional space such that when x is projected there, information loss is minimized. Let the projection of x on the direction of w is z = wTx. The PCA will find w such that Var(z) is maximized Var(z) = wT ∑ w where Var(x)= E[(x – μ)(x –μ)T] = ∑ If z1 = w1Tx with Cov(x) = ∑ then Var(z1) = w1T ∑ w1, and maximize Var(z1) subject to ||w1||=1. Please show that the first principal component is the eignvector of the covariance matrix of the input sample with the largest eigenvalue. 8 2. (5%) Given a sample of two classes, X  x , r t t  t 第 9 頁共 9 頁 , where r  1 if x  C1 and r t  0 if t x  C2 . In logistic discrimination, assume that the log likelihood ratio is linear in two classes case, the estimator of PC1 | x is the sigmoid function y  PC1 | x   1 1  exp  w T x  w0   We assume r t , given xt , is Bernoulli distribution. Then the sample likelihood is r  1 r  , and the cross-entropy is l w, w0 | X     y t  1  y t  t t t E w, w0 | X    (r t log y t  (1  r t )log( 1  y t ) ) t Please find the update equations of w j and w0 , where w j   E E ,and w0   , j  1,..., d . w0 w j 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 第頁共9頁 Machine Learning Final Exam. Student No.: Name: 104/6