Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones What is machine learning? Data interpretation describing relationship between predictors and responses finding naural groups or classes in data relating observables (or a function thereof) to physical quantities Prediction capturing relationship between “inputs” and “outputs” for a set of labelled data with the goal of predicting outputs for unlabelled data (“pattern recognition”) C.A.L. Bailer-Jones. Machine Learning. Introduction 1 What is statistics? C.A.L. Bailer-Jones. Machine Learning. Introduction 2 Many names... the systematic and quantitative way of making inferences from data data is considered as the outcome of a random event variability is expressed by a probability distribution mathematics is used to manipulate probability distributions allows us to write down statistical models for the data and solve for the parameters of interest C.A.L. Bailer-Jones. Machine Learning. Introduction Learning from data 3 machine learning statistical learning pattern recognition statistical data modelling data mining (although this has other meanings) multivariate data analysis ... C.A.L. Bailer-Jones. Machine Learning. Introduction 4 learn mapping from spectra to data using labelled examples multidimensional (few hundred) noninear inverse Willemsenet al. (2005) Parameter estimation from stellar spectra C.A.L. Bailer-Jones. Machine Learning. Introduction 5 Willemsenet al. (2005) Parameter estimation from stellar spectra C.A.L. Bailer-Jones. Machine Learning. Introduction 6 want to classify galaxies based on the appearance of their observed spectra can simulate spectra of known classes much variance within these basic classes www.astro.princeton.edu Galaxy spectral classification C.A.L. Bailer-Jones. Machine Learning. Introduction 7 Top right: colour-colour diagram showing 8 basic types compared to locus of SDSS galaxies Above: locus of 10,000 simulated galaxies in which various parameters have been varied C.A.L. Bailer-Jones. Machine Learning. Introduction Tsalmantza et al. (2007) Tsalmantza et al. (2007) Right: Optical synthetic spectra of 8 basic galaxy types (SDSS filters overlaid) 8 Course objectives Right: Optical synthetic spectra of 8 basic galaxy types (SDSS filters overlaid) Above: the 10,000 simulated galaxy spectra projected into the space of the first 3 Principal Components. (These plus mean explain 99.7% of variance.) C.A.L. Bailer-Jones. Machine Learning. Introduction Tsalmantza et al. (2007) Top right: Simulated Gaia spectra of the 8 basic galaxy types learn the basic concepts of machine learning learn the basic tools and methods of machine learning identify appropriate methods interpret results in light of methods used recognise inherent limitations and assumptions linear and nonlinear methods methods for high dimensional data become familiar with a freely available package for modelling data (R) 9 Lecture schedule C.A.L. Bailer-Jones. Machine Learning. Introduction 10 Online material and texts http://www.mpia.de/homes/calj/ss2007_mlpr.html C.A.L. Bailer-Jones. Machine Learning. Introduction 11 viewgraphs R scripts used in lectures bilbiography, links to articles recommended book The Elements of Statistical Learning, Hastie et al. (2001), Springer links to R tutorials C.A.L. Bailer-Jones. Machine Learning. Introduction 12 The course R (S, S-PLUS) assumed knowledge simple probability theory and statistics (distributions, hypothesis testing, least squares, ...) linear algebra (matrices, eigenvalue problems, ...) elementary calculus interact learn by being active, not passive! question, criticize, etc. hands-on: learn a package and play with data C.A.L. Bailer-Jones. Machine Learning. Introduction 13 Supervised and unsupervised learning http://www.r-project.org “a language and environment for statistical computing and graphics” open source runs on linux, Windows, MacOS operations for vectors and matrices large number of statistical and machine learning packages can link to C, C++ and Fortran code Good book on using R for statistics Modern Applied Statistics with S, Venables & Ripley, 2002, Springer C.A.L. Bailer-Jones. Machine Learning. Introduction 14 A simple problem of data fitting Supervised learning for each observed vector (the “predictors”, “inputs”, “independent variables”), x, there are one or more dependent variables (“responses”, “outputs”), y, or two or more classes, C. regression problems: goal is to learn a function, y = f(x; ), where is a set of parmeters to be inferred from a training set of pre-labelled vectors {x, y} classification problems: goal is either to define decision boundaries between objects with different classes, or to model the density of the class probabilities over the data, i.e. P1(C = C1) = f(x; ), where parametrizes the probability density function (PDF) and is learned from a training set of pre-classified vectors {x, C} Unsupervised learning no pre-labelled data or pre-defined dependent variables or classes goal is to find either “natural” classes/clusterings in data, or simpler (e.g. lower dimensional) variables which explain the data Examples: PCA, K-means clustering C.A.L. Bailer-Jones. Machine Learning. Introduction 15 C.A.L. Bailer-Jones. Machine Learning. Introduction 16 Learning, generalization and regularization Learning from data See R scripts on web page make assumptions about smoothness of function regularization generalization take into account variance (errors) in data domain knowledge helps (if it's reliable...) this is data interpolation extrapolation is less contrained 17 C.A.L. Bailer-Jones. Machine Learning. Introduction Notation (as used by Hastie et al.) Fitting a model: linear least squares Input variables denoted by X Output variables denotes by Y i=1..N observations with j=1..p dimensions Upper case used to refer to generic aspects of variables if a vector, subscripts access components X specific observed values are written in lower case x p X 0 j j j 1 X , are p x 1 column vectors Determine parameters by minimizing sum-of-squares error on all N training data X ,Y = = x Ti x i is a p x 1 column vector y X T = y X y X X is an N x p matrix RSS j = = = XT Y j 18 C.A.L. Bailer-Jones. Machine Learning. Introduction Bold is used for vectors and matrices vectors in lower case x , x i matrices in upper case X Hastie et al. do not use bold for p-vectors Parameters use Greek letters , , can also be vectors, of course min N i 1 2 yi min 2 T This is quadratic in so always has a minimum. Differentiate w.r.t X T y X = XT X = XT y j If X T X 0 (the “information matrix”) is non-singular then the unique solution is T T = X X X y 1 C.A.L. Bailer-Jones. Machine Learning. Introduction 19 C.A.L. Bailer-Jones. Machine Learning. Introduction 20 2-class classification: linear decision boundary G X T 2 Gi = 0 for green class Gi = 1 for red class Boundary is XT = 0.5 C.A.L. Bailer-Jones. Machine Learning. Introduction 21 Comparison approximated by a globally linear function C.A.L. Bailer-Jones. Machine Learning. Introduction stable but biased learn relationship between (X, y) and encapsulate into parameters, K-nearest neighbours no assumption about functional form of relationship (X, y), i.e. it is nonparametric but does assume that function is well-approximated by a locally constant function less stable but less biased no free parameters to learn, so application to new data relatively slow: brute force search for neighbours takes O(N) C.A.L. Bailer-Jones. Machine Learning. Introduction 22 Which solution is optimal? Linear model makes a very strong assumption about the data viz. well © Hastie, Tibshirani, Friedman (2001) © Hastie, Tibshirani, Friedman (2001) min 2-class classification: K-nearest neighbours 23 if we know nothing about how the data were generated (underlying model, noise), we don't know if data drawn from two uncorrelated Gaussians: linear decision boundary is almost optimal if data drawn from mixture of multiple distributions: linear boundary not optimal (nonlinear, disjoint) what is optimal? smallest generalization errors simple solution (interpretability) more complex models permit lower errors on training data but we want models to generalize need to control complexity / nonlinearity (regularization) with enough training data, wouldn't k-nn be best? C.A.L. Bailer-Jones. Machine Learning. Introduction 24 Summary supervised vs. unsupervised learning generalization and regularization regression vs. classification parametric vs. non-parametri linear regression, k nearest neighbours least squares C.A.L. Bailer-Jones. Machine Learning. Introduction 25