* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Pattern recognition – basic concepts
Survey
Document related concepts
Transcript
Pattern recognition – basic concepts Sample Input Attributes Sepal Sepal Inst. Length Width 1 5.1 3.5 2 4.9 3 3 4.7 3.2 4 4.6 3.1 5 5 3.6 Output Attribute Petal Length Petal Width Species 1.4 1.4 1.3 1.5 1.4 0.2 0.2 0.2 0.2 0.2 setosa setosa setosa setosa setosa • input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní proměnná, nezávisle proměnná) • class, output variable, dependendent variable (třída, výstupní proměnná, závislá proměnná) • sample (vzorek) Handwritten digits • each digit 28 x 28 pixels – so each digit can be represented by a vector x comprising 784 real numbers • goal: – build a machine that will take x as input and will produce the identity of the digit 0 … 9 as the output – non-trivial problem due to the wide variability of handwriting – could be tackled by using rules for distinguishing the digits based on the shapes of the strokes – in practice such an approach leads to a proliferation of rules and of exceptions to the rules and so on, and invariably gives poor results • better way – adopt machine learning algorithm (i.e. use some adaptive model) input model internal parameters influencing the behavior of the model must be adjusted • tune the parameters of the model using the training set – training set is a data set of N digits {x1, …, xN} • the categories of the digits in the training set are known in advance (inspected manually), they form a target vector t for each digit (one target vector for each digit) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 0 0 0 The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. x1 x2 x3 x4 x5 x6 y(x) vector y .. . x784 • The precise form of the function y(x) is determined during the training phase (learning phase). – Model adapts its parameters (i.e. learns) on the basis of the training data {x1, …, xN}. • Trained model can then determine the identity of new, previously unseen, digit images which are said to comprise a test set. • The ability to categorize correctly new examples that differ from those used for training is known as generalization. For most practical applications, the original input variables are typically preprocessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve. x1 x2 x3 x4 x5 x6 .. . Preprocessing y(x) vector y x784 • Translate and scale the images of the digits so that each digit is contained within a box of a fixed size. • This greatly reduces the variability within each digit class, because the location and scale of all the digits are now the same. • This pre-processing stage is sometimes also called feature extraction. • Test data must be pre-processed using the same steps as the training data. Feature selection and feature extraction selection x1 x2 x3 x1 x4 x5 x5 x103 extraction x6 .. . x456 x784 x1 x* 1 x2 x* 2 x3 x4 x5 x* 3 x* 4 x* 5 x*18 x*152 x*309 x6 .. . x* 6 x784 .. . x*666 x*784 Dimensionality reduction • We want to reduce number of dimensions because: – Efficiency • measurement costs • storage costs • computation costs – Problem may be solved more easily in the new space – Improved classification performance – Ease of interpretation Curse of dimensionality Bishop, Pattern Recognition and Machine Learning Supervised learning • training data comprises examples of the input vectors along with their corresponding target vectors (e.g. digit recognition) – classification – aim: assign an input vector to one of a finite number of discrete categories – regression (data/curve fitting) - desired output consists of one or more continuous variables Unsupervised learning • training data consists of a set of input vectors x without any corresponding target value • goals: – discover groups of similar examples within the data – clustering – project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization Polynomial curve fitting • regression problem, supervised • we observe a real-valued input variable x and we wish to use this observation to predict the value of a real-valued target variable t • artificial example - sin(2πxn) + random noise • training set: N observations of x written as x = (x1, … , xN)T + corresponding observations of the values of t: t = (t1, … , tN)T sin(2πxn) + random noise N = 10 x1 -> t1 x2 -> t2 etc. training data set {x, t} adapted from Bishop, Pattern Recognition and Machine Learning • goal: exploit the training set in order to make prediction of the target variable t’ for new value x’ of the input variable • this is generally difficult, as – we have to generalize from the finite data set – data are corrupted by the noise, so for the given x’ there is uncertainty in the value of t’ t’ x’ adapted from Bishop, Pattern Recognition and Machine Learning • decision: which method to use? • From the plethora of possibilities (you do not know about yet) I chose a simple one – data will be fitted using this polynomial function 𝑀 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝑀 𝑥 𝑀 = 𝑤𝑗 𝑥 𝑗 𝑗=0 • polynomial coefficients w0, …, wM form a vector w – they represent parameters of the model that must be set in the training phase – the polynomial model is still linear regression!!! • The values of coefficients will be determined by minimizing the error function • It measures the misfit between the function y(x, w) and the training set data points • one simple choice: the sum of squared errors - SSE between the predictions y(xn, w) for each data point xn and the correspoding values tn 1 𝑆𝑆𝐸 = 2 𝑁 𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛 𝑛=1 2 1 𝑆𝑆𝐸 = 2 𝑁 𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛 2 𝑛=1 displacement of the data point tn from the function y(xn, w) fitted function Bishop, Pattern Recognition and Machine Learning • solving curve fitting problem means choosing the value of w for which E(w) is as small as possible … w* → y(x, w*) – as small as possible means to find a minimum of E(w) (i.e. its derivatives)