Download Introduction to machine learning, pattern recognition and statistical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Introduction to machine learning, pattern recognition
and statistical data modelling
Coryn Bailer-Jones
What is machine learning?
Data interpretation
– describing relationship between predictors and responses
– finding naural groups or classes in data
– relating observables (or a function thereof) to physical quantities
Prediction
– capturing relationship between “inputs” and “outputs” for a set of
labelled data with the goal of predicting outputs for unlabelled data
(“pattern recognition”)
C.A.L. Bailer-Jones. Machine Learning. Introduction
1
What is statistics?
C.A.L. Bailer-Jones. Machine Learning. Introduction
2
Many names...
the systematic and quantitative way of making inferences
from data
data is considered as the outcome of a random event
– variability is expressed by a probability distribution
mathematics is used to manipulate probability distributions
– allows us to write down statistical models for the data and solve for
the parameters of interest
C.A.L. Bailer-Jones. Machine Learning. Introduction
Learning from data
3
machine learning
statistical learning
pattern recognition
statistical data modelling
data mining (although this has other meanings)
multivariate data analysis
...
C.A.L. Bailer-Jones. Machine Learning. Introduction
4
learn mapping from spectra to data
using labelled examples
multidimensional (few hundred)
noninear
inverse
Willemsenet al. (2005)
Parameter estimation from
stellar spectra
C.A.L. Bailer-Jones. Machine Learning. Introduction
5
Willemsenet al. (2005)
Parameter estimation from
stellar spectra
C.A.L. Bailer-Jones. Machine Learning. Introduction
6
want to classify galaxies
based on the appearance of
their observed spectra
can simulate spectra of
known classes
much variance within these
basic classes
www.astro.princeton.edu
Galaxy spectral classification
C.A.L. Bailer-Jones. Machine Learning. Introduction
7
Top right: colour-colour diagram
showing 8 basic types compared to
locus of SDSS galaxies
Above: locus of 10,000 simulated
galaxies in which various
parameters have been varied
C.A.L. Bailer-Jones. Machine Learning. Introduction
Tsalmantza et al. (2007)
Tsalmantza et al. (2007)
Right: Optical synthetic spectra of 8
basic galaxy types (SDSS filters
overlaid)
8
Course objectives
Right: Optical synthetic spectra of 8
basic galaxy types (SDSS filters
overlaid)
Above: the 10,000 simulated galaxy
spectra projected into the space of
the first 3 Principal Components.
(These plus mean explain 99.7%
of variance.)
C.A.L. Bailer-Jones. Machine Learning. Introduction
Tsalmantza et al. (2007)
Top right: Simulated Gaia spectra of
the 8 basic galaxy types
learn the basic concepts of machine learning
learn the basic tools and methods of machine learning
– identify appropriate methods
– interpret results in light of methods used
– recognise inherent limitations and assumptions
– linear and nonlinear methods
– methods for high dimensional data
become familiar with a freely available package for
modelling data (R)
9
Lecture
schedule
C.A.L. Bailer-Jones. Machine Learning. Introduction
10
Online material and texts
http://www.mpia.de/homes/calj/ss2007_mlpr.html
C.A.L. Bailer-Jones. Machine Learning. Introduction
11
viewgraphs
R scripts used in lectures
bilbiography, links to articles
recommended book
– The Elements of Statistical Learning, Hastie et al. (2001), Springer
links to R tutorials
C.A.L. Bailer-Jones. Machine Learning. Introduction
12
The course
R (S, S-PLUS)
assumed knowledge
– simple probability theory and statistics (distributions, hypothesis
testing, least squares, ...)
– linear algebra (matrices, eigenvalue problems, ...)
– elementary calculus
interact
– learn by being active, not passive!
– question, criticize, etc.
– hands-on: learn a package and play with data
C.A.L. Bailer-Jones. Machine Learning. Introduction
13
Supervised and unsupervised learning
http://www.r-project.org
“a language and environment for statistical computing and
graphics”
open source
runs on linux, Windows, MacOS
operations for vectors and matrices
large number of statistical and machine learning packages
can link to C, C++ and Fortran code
Good book on using R for statistics
– Modern Applied Statistics with S, Venables & Ripley, 2002, Springer
C.A.L. Bailer-Jones. Machine Learning. Introduction
14
A simple problem of data fitting
Supervised learning
– for each observed vector (the “predictors”, “inputs”, “independent
variables”), x, there are one or more dependent variables
(“responses”, “outputs”), y, or two or more classes, C.
regression problems: goal is to learn a function, y = f(x; ), where is a set of
parmeters to be inferred from a training set of pre-labelled vectors {x, y}
classification problems: goal is either to define decision boundaries between
objects with different classes, or to model the density of the class probabilities over
the data, i.e. P1(C = C1) = f(x; ), where parametrizes the probability density
function (PDF) and is learned from a training set of pre-classified vectors {x, C}
Unsupervised learning
– no pre-labelled data or pre-defined dependent variables or classes
– goal is to find either
“natural” classes/clusterings in data, or
simpler (e.g. lower dimensional) variables which explain the data
– Examples: PCA, K-means clustering
C.A.L. Bailer-Jones. Machine Learning. Introduction
15
C.A.L. Bailer-Jones. Machine Learning. Introduction
16
Learning, generalization and regularization
Learning from data
See R scripts on
web page
make assumptions about smoothness of function
– regularization
– generalization
take into account variance (errors) in data
domain knowledge helps (if it's reliable...)
this is data interpolation
– extrapolation is less contrained
17
C.A.L. Bailer-Jones. Machine Learning. Introduction
Notation (as used by Hastie et al.)
Fitting a model: linear least squares
Input variables denoted by X
Output variables denotes by Y
i=1..N observations with j=1..p dimensions
Upper case used to refer to generic aspects of variables
– if a vector, subscripts access components X
– specific observed values are written in lower case x
p
X
0
j
j
j
1
X ,
are p x 1 column vectors
Determine parameters by minimizing sum-of-squares error on all N training data
X ,Y
= = x Ti x i is a p x 1 column vector
y X T = y X y X X is an N x p matrix
RSS
j
= =
= XT Y
j
18
C.A.L. Bailer-Jones. Machine Learning. Introduction
Bold is used for vectors and matrices
– vectors in lower case x , x i
– matrices in upper case X
– Hastie et al. do not use bold for p-vectors
Parameters use Greek letters , , – can also be vectors, of course
min
N
i
1
2
yi
min
2
T
This is quadratic in so always has a minimum. Differentiate w.r.t X T y X =
XT X = XT y
j
If
X
T
X
0
(the “information matrix”) is non-singular then the unique solution is
T
T
= X X X y
1
C.A.L. Bailer-Jones. Machine Learning. Introduction
19
C.A.L. Bailer-Jones. Machine Learning. Introduction
20
2-class classification: linear decision boundary
G X T 2
Gi = 0 for green class
Gi = 1 for red class
Boundary is
XT =
0.5
C.A.L. Bailer-Jones. Machine Learning. Introduction
21
Comparison
approximated by a globally linear function
C.A.L. Bailer-Jones. Machine Learning. Introduction
stable but biased
– learn relationship between (X, y) and encapsulate into parameters, K-nearest neighbours
– no assumption about functional form of relationship (X, y), i.e. it is
nonparametric
– but does assume that function is well-approximated by a locally
constant function
less stable but less biased
– no free parameters to learn, so application to new data relatively
slow: brute force search for neighbours takes O(N)
C.A.L. Bailer-Jones. Machine Learning. Introduction
22
Which solution is optimal?
Linear model
– makes a very strong assumption about the data viz. well
© Hastie, Tibshirani, Friedman (2001)
© Hastie, Tibshirani, Friedman (2001)
min
2-class classification: K-nearest neighbours
23
if we know nothing about how the data were generated
(underlying model, noise), we don't know
if data drawn from two uncorrelated Gaussians: linear
decision boundary is almost optimal
if data drawn from mixture of multiple distributions: linear
boundary not optimal (nonlinear, disjoint)
what is optimal?
– smallest generalization errors
– simple solution (interpretability)
more complex models permit lower errors on training data
– but we want models to generalize
– need to control complexity / nonlinearity (regularization)
with enough training data, wouldn't k-nn be best?
C.A.L. Bailer-Jones. Machine Learning. Introduction
24
Summary
supervised vs. unsupervised learning
generalization and regularization
regression vs. classification
parametric vs. non-parametri
linear regression, k nearest neighbours
least squares
C.A.L. Bailer-Jones. Machine Learning. Introduction
25