Download Exercise 1: Consider the two data matrices 3 7 2 4 4 7 and X 6 9 5 7

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Regression analysis wikipedia , lookup

Predictive analytics wikipedia , lookup

Inverse problem wikipedia , lookup

Generalized linear model wikipedia , lookup

Pattern recognition wikipedia , lookup

Receiver operating characteristic wikipedia , lookup

Least squares wikipedia , lookup

Psychometrics wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Multivariate Statistics
Thomas Asendorf, Steffen Unkel
Study sheet 2
Summer term 2017
Exercise 1:
Consider the two data matrices


3 7


X1 =  2 4 
4 7


6 9


and X2 =  5 7 
4 8
for which the sample mean vectors are
x̄1 =
3
6
!
and x̄2 =
5
8
!
and the pooled sample covariance matrix is
S=
1 1
1 2
!
.
(a) Calculate Fisher’s linear discriminant function coefficients.
(b) Based on the results obtained in (a), classify the observation x>
0 = (2 7) as population
1 or population 2.
Exercise 2:
Consider predicting if a patient has diabetes on the basis of a quick, spontaneous blood
measurement of the concentration of glucose. Imagine that we know the true marginal
continuous distributions of concentrations in the patients who do not have diabetes and
those who do have diabetes. Let X and Y be continuous random variables with probability
density function (pdf) fX (x) and fY (y), respectively, where X represents concentrations in
the patients who do not have diabetes and Y represents concentrations in patients who do
have diabetes. An observation above a given threshold θ is classified as having diabetes, while
values below or equal θ are classified as not having diabetes.
(a) Formulate the true positive rate and false positive rate using the pdfs mentioned above.
(b) Show that for the area under the ROC curve (AUC) the following equation holds:
AUC = P(X ≤ Y ).
Date: 28 April 2017
Page 1
Exercise 3:
Let fY (y) and fZ (z) be the pdfs associated with the continuous random variables Y and Z
for the populations G1 and G2 , respectively. An observation must be assigned to one and
only one of the two populations. Let p be the prior probability of G1 and 1 − p be the prior
probability of G2 . We want to minimize the total probability of misclassification (TPM):
TPM = P(misclassifying a G1 observation or misclassifying a G2 observation) .
(a) Show that a cut-point θ, which minimizes the TPM, fulfills:
fY (θ)
1−p
=
.
fZ (θ)
p
(b) We now assume that Y ∼ N (µy , σ 2 ) and Z ∼ N (µz , σ 2 ). Show that the optimal cutpoint θ fulfills:
σ 2 (ln(1 − p) − ln(p)) µy + µz
+
θ=
.
µy − µz
2
Exercise 4:
When the number of predictors (features) p is large, there tends to be a deterioration in the
performance of K-nearest neighbour (KNN) and other local classifiers that perform prediction
using only observations that are near the test observation for which a prediction must be
made. In this exercise we will look further into this matter.
(a) To begin with, we assume that there is only a single feature and suppose we have
set of observation pairs (yi , xi ) for i = 1, . . . , n, where x1 , . . . , xn are samples drawn
from a uniformly distributed random variable on [0, 1] and y1 , . . . , yn are some response
variable. Suppose that we wish to predict a test observation’s response using only
observations that are within 10% of the range of x = (x1 , . . . , xn )> closest to that test
observation. For example, in order to predict the response y0 for a test observation
with x0 = 0.6, we will use all available observations from x in the range [0.55, 0.65]. On
average, what fraction of the available observations will be used to make the prediction?
(b) Now suppose that we have a set of p features. The n measurements on the p features
>
> >
>
are stored in an n × p data matrix X = (x>
1 , x2 , . . . , xn ) , where xi = (xi1 , . . . , xip )
for i = 1, . . . , n. We assume that x1 , x2 , . . . , xn are samples drawn from uniformly
distributed random variables on [0, 1]p . We wish to predict a test observation’s response
using observations within the 10% of each feature’s range that is closest to that test
observations. What fraction of the available observations will be used to make the
prediction? How high is the fraction for p = 100?
(c) Using your findings obtained in (a) and (b), do you see any drawbacks associated with
KNN?
Date: 28 April 2017
Page 2
Exercise 5:
In this exercise, we will compare different classification techniques to predict whether a given
car gets high or low gas mileage. We will use the Auto data set contained in the R package
ISLR.
(a) Create a binary variable, mpg01, that is equal to 1 if mpg is above its median and 0
otherwise. Save the newly created variable in a data frame called dataset. Further,
add the variables weight and year from Auto to the data frame dataset.
(b) Split dataset into a training set and a test set by randomly allocating 70% of all
observations to a training set and the remaining 30% to a test set.
(c) Perform linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA)
on the training data. Compute the training error and the test error of your chosen
models. Visualize your findings.
(d) Perform KNN on the training data, with several values of K. What test errors do you
obtain? Plot the test errors versus the value of K, to eyeball an “optimal” choice of K.
Date: 28 April 2017
Page 3