Download 第頁共9頁 Machine Learning Final Exam. Student No.: Name: 104/6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computational chemistry wikipedia , lookup

Computational fluid dynamics wikipedia , lookup

Algorithm wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Predictive analytics wikipedia , lookup

Generalized linear model wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Simplex algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Regression analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Least squares wikipedia , lookup

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
第 1 頁共 9 頁
Machine Learning
Final Exam.
Student No.:
Name:
104/6
一、 是非題(30%)
(
)1. Principal components analysis (PCA) and linear discriminant analysis (LDA) are both
supervised dimensionality reduction method.
(
)2. Clustering methods find correlations between variables and thus group variables.
Dimensionality reduction methods find similarities between instances and thus group
instances.
(
)3. k-means clustering is a local search procedure, and the final cluster means mi are highly
depend on the initial mi. Moreover, the k-means clustering algorithm is used to solve the
supervised learning problem.
(
)4. Locally linear embedding recovers global nonlinear structure from locally linear fits. Its
assumptions are (1) each local patch of the manifold can be approximated linearly. (2)
Given enough data, each point can be written as a linear, weighted sum of its neighbors.
(
)5. Isomap uses the geodesic distances between all pairs of data points. For neighboring points
that are close in the input space, Euclidean distance can be used. For faraway points,
geodesic distance is approximated by the sum of the distances between the points along
the way over the manifold.
(
)6. “Similar inputs have similar outputs” is one assumption of nonparameteric method.
(
)7. In probability theory and statistics, a sequence or other collection of random variables is
independent and identically distributed (i.i.d.) if each random variable has the same
probability distribution as the others and all are mutually independent.
(
)8. The impurity measure of a classification tree should be satisfies the following properties:
(1)  (1 / 2,1 / 2)   ( p,1  p), for any p  [0,1] , (2)  (0,1)   (1,0)  0 , and (3)
 ( p,1  p) is increasing in p on [0,1 / 2] and decreasing in p on [1 / 2,1]
(
)9. A decision tree is a hierarchical model using a divided-and-conquer strategy.
(
)10. To remove subtrees in a decision tree, postpruning is faster and prepruning is more
accurate.
(
)11. Being a discriminant-based method, the SVM cares only about the instances close to the
boundary and discards those that lie in the interior.
(
)12. Entropy in information theory specifies the maximum number of bits needed to encode
the classification accuracy of an instance.
(
)13. Knowledge of any sort related to the application should be built into the network structure
whenever possible. These are called hints.
(
)14. In a multilayer perceptron, if the number of hidden units is less than the number of inputs,
the first layer performs a dimensionality reduction.
(
)15. In SIMD machines, different processors may execute different instructions on different
data.
1
第 2 頁共 9 頁
二、 簡答題
1. (3%) What is the difference between feature selection methods and feature extraction methods?
2. (2%) Can you explain what Isomap is?
3. (3%) What are the differences between the parametric density estimation methods and the
nonparametric density estimation methods?
4.
(4%) LDA(Linear Discriminant Analysis) is a supervised method for dimensionality reduction
for classification problems. What are the assumptions of LDA to fine the transformation matrix
w?
2
第 3 頁共 9 頁
5. (4%) Draw two-class, two-dimensional data such that
(a) PCA and LDA find the same direction and
(b) PCA and LDA find totally different directions.
6. (3%) Please explain the following algorithm.
7. (2%) Condensed Nearest Neighbor algorithm is used to find a subset Z of X that is small and is
accurate in classifying X. Please finish the following Condensed Nearest Neighbor algorithm.
3
第 4 頁共 9 頁
8. (3%) Given a two-dimensional dataset as follows, please show the dendrogram(樹狀圖) of the
complete-link clustering result. The complete-link distance between two groups Gi and Gj:


d  Gi , G j   r maxs d  x r , x s  where d x r ,x s   j 1 x rj  x sj
x Gi , x G j
d
9. (3%) Given a k-nearest neighbor density estimate as follows:
pˆ ( x) 
k
2 Nd k ( x)
where d k (x) is the distance to the k-nearest sample, and N is the total sample number. Given the
result of the k-nearest neighbor density estimator as follows. What is the value of k?
10. (4%) In nonparametric regression, given a running mean smoother as follows, please finish the
graph with h = 1.
 bx, x  r
gˆ x  
 bx, x 
N
t
t 1
N
t 1
t
t
1 if xt is in the same bin with x
where b x, xt  
0 otherwise


4
第 5 頁共 9 頁
11. (6%) Given a regression tree as follows.
(1) Please draw its corresponding regression result.
(2) Could you show one rule which is extracted from this regression tree?
(3) In this case, what is the meanings of a leaf node and an internal node in a decision tree?
12. (4%) In pairwise separation example as follows, and Hij indicates the hyperplane separate the
examples of Ci and the examples of Cj Please decide each region belongs to which class.
gij  x|w ij , wij 0   wTij x  wij 0
if x  Ci
 0

gij  x     0
if x  C j
don't care otherwise

choose Ci if j  i , gij  x   0
5
第 6 頁共 9 頁
13. (6%) Given a Classification tree construction algorithm as follows.
where
K
n
I m   pmi log 2 pmi (eq. 9.3) and I'm  
i 1
j 1
N mj
Nm
K
p
i 1
i
mj
i
log 2 pmj
(eq. 9.8)
Can you explain what the functions “GenerateTree” and “SplitAttribute” do?
6
第 7 頁共 9 頁
14. (3%) Please assign the weights of the multilayer perceptron to solve the following problem.
15. (3%) In neural network, can we have more than one hidden layers? Why or why not?
16. (3%) Why is a neural network overtraining (or overfitting)?
17. (4%) (1) What are support vectors in support vector machine?
(2) Given an example as follows. Please show the supports vectors.
7
第 8 頁共 9 頁
三、計算證明題
1. (5%) Using principal components analysis, we can find a low-dimensional space such that when
x is projected there, information loss is minimized. Let the projection of x on the direction of w
is z = wTx. The PCA will find w such that Var(z) is maximized
Var(z) = wT ∑ w
where Var(x)= E[(x – μ)(x –μ)T] = ∑
If z1 = w1Tx with Cov(x) = ∑ then Var(z1) = w1T ∑ w1, and maximize Var(z1) subject to ||w1||=1.
Please show that the first principal component is the eignvector of the covariance matrix of
the input sample with the largest eigenvalue.
8
2. (5%) Given a sample of two classes, X  x , r
t
t

t
第 9 頁共 9 頁
, where r  1 if x  C1 and r t  0 if
t
x  C2 . In logistic discrimination, assume that the log likelihood ratio is linear in two classes
case, the estimator of PC1 | x is the sigmoid function
y  PC1 | x  
1
1  exp  w T x  w0


We assume r t , given xt , is Bernoulli distribution. Then the sample likelihood is
r 
1 r 
, and the cross-entropy is
l w, w0 | X     y t  1  y t 
t
t
t
E w, w0 | X    (r t log y t  (1  r t )log( 1  y t ) )
t
Please find the update equations of w j and w0 ,
where w j  
E
E
,and w0  
, j  1,..., d .
w0
w j
9