Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data analysis wikipedia , lookup
Regression analysis wikipedia , lookup
Predictive analytics wikipedia , lookup
Inverse problem wikipedia , lookup
Generalized linear model wikipedia , lookup
Pattern recognition wikipedia , lookup
Receiver operating characteristic wikipedia , lookup
Least squares wikipedia , lookup
Psychometrics wikipedia , lookup
Multivariate Statistics Thomas Asendorf, Steffen Unkel Study sheet 2 Summer term 2017 Exercise 1: Consider the two data matrices 3 7 X1 = 2 4 4 7 6 9 and X2 = 5 7 4 8 for which the sample mean vectors are x̄1 = 3 6 ! and x̄2 = 5 8 ! and the pooled sample covariance matrix is S= 1 1 1 2 ! . (a) Calculate Fisher’s linear discriminant function coefficients. (b) Based on the results obtained in (a), classify the observation x> 0 = (2 7) as population 1 or population 2. Exercise 2: Consider predicting if a patient has diabetes on the basis of a quick, spontaneous blood measurement of the concentration of glucose. Imagine that we know the true marginal continuous distributions of concentrations in the patients who do not have diabetes and those who do have diabetes. Let X and Y be continuous random variables with probability density function (pdf) fX (x) and fY (y), respectively, where X represents concentrations in the patients who do not have diabetes and Y represents concentrations in patients who do have diabetes. An observation above a given threshold θ is classified as having diabetes, while values below or equal θ are classified as not having diabetes. (a) Formulate the true positive rate and false positive rate using the pdfs mentioned above. (b) Show that for the area under the ROC curve (AUC) the following equation holds: AUC = P(X ≤ Y ). Date: 28 April 2017 Page 1 Exercise 3: Let fY (y) and fZ (z) be the pdfs associated with the continuous random variables Y and Z for the populations G1 and G2 , respectively. An observation must be assigned to one and only one of the two populations. Let p be the prior probability of G1 and 1 − p be the prior probability of G2 . We want to minimize the total probability of misclassification (TPM): TPM = P(misclassifying a G1 observation or misclassifying a G2 observation) . (a) Show that a cut-point θ, which minimizes the TPM, fulfills: fY (θ) 1−p = . fZ (θ) p (b) We now assume that Y ∼ N (µy , σ 2 ) and Z ∼ N (µz , σ 2 ). Show that the optimal cutpoint θ fulfills: σ 2 (ln(1 − p) − ln(p)) µy + µz + θ= . µy − µz 2 Exercise 4: When the number of predictors (features) p is large, there tends to be a deterioration in the performance of K-nearest neighbour (KNN) and other local classifiers that perform prediction using only observations that are near the test observation for which a prediction must be made. In this exercise we will look further into this matter. (a) To begin with, we assume that there is only a single feature and suppose we have set of observation pairs (yi , xi ) for i = 1, . . . , n, where x1 , . . . , xn are samples drawn from a uniformly distributed random variable on [0, 1] and y1 , . . . , yn are some response variable. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of x = (x1 , . . . , xn )> closest to that test observation. For example, in order to predict the response y0 for a test observation with x0 = 0.6, we will use all available observations from x in the range [0.55, 0.65]. On average, what fraction of the available observations will be used to make the prediction? (b) Now suppose that we have a set of p features. The n measurements on the p features > > > > are stored in an n × p data matrix X = (x> 1 , x2 , . . . , xn ) , where xi = (xi1 , . . . , xip ) for i = 1, . . . , n. We assume that x1 , x2 , . . . , xn are samples drawn from uniformly distributed random variables on [0, 1]p . We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observations. What fraction of the available observations will be used to make the prediction? How high is the fraction for p = 100? (c) Using your findings obtained in (a) and (b), do you see any drawbacks associated with KNN? Date: 28 April 2017 Page 2 Exercise 5: In this exercise, we will compare different classification techniques to predict whether a given car gets high or low gas mileage. We will use the Auto data set contained in the R package ISLR. (a) Create a binary variable, mpg01, that is equal to 1 if mpg is above its median and 0 otherwise. Save the newly created variable in a data frame called dataset. Further, add the variables weight and year from Auto to the data frame dataset. (b) Split dataset into a training set and a test set by randomly allocating 70% of all observations to a training set and the remaining 30% to a test set. (c) Perform linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) on the training data. Compute the training error and the test error of your chosen models. Visualize your findings. (d) Perform KNN on the training data, with several values of K. What test errors do you obtain? Plot the test errors versus the value of K, to eyeball an “optimal” choice of K. Date: 28 April 2017 Page 3