Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Predictive Learning LECTURE SET 2 Basic Learning Approaches and Complexity Control Electrical and Computer Engineering 1 OUTLINE 2.0 Objectives 2.1 Data Encoding + Preprocessing 2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches 2.4 Generalization and Complexity Control 2.5 Application Example 2.6 Summary 2 2.0 Objectives 1. To quantify the notions of explanation, prediction and model 2. Introduce terminology 3. Describe basic learning methods 4. Importance of complexity control for generalization 3 Learning as Induction Induction ~ function estimation from data: Deduction ~ prediction for new inputs: aka Standard Inductive Learning Setting 4 2.1 Data Encoding + Preprocessing Common Types of Input & Output Variables (input variables ~ features) • Real-valued • Categorical (class labels) • Ordinal (or fuzzy) variables • Classification: categorical output • Regression: real-valued output • Ranking: ordinal output 5 • Data Preprocessing and Scaling Preprocessing is required with observational data (step 4 in general experimental procedure) • Basic preprocessing includes - summary univariate statistics: mean, st. deviation, min + max value, range, boxplot performed independently for each input/output - detection (removal) of outliers - scaling of input/output variables (may be necessary for some learning algorithms) • Visual inspection of data is tedious but useful 6 Animal Body & Brain Weight Data (original, unscaled) 7 Removing Outliers Remove outliers: Brachiosaurus, Diplodocus, Triceratop, African elephant, Asian elephant and plot the data scaled to [0,1] range: 1 0.9 0.8 0.7 Brain weight • 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Body weight 0.7 0.8 0.9 1 8 2.2 Terminology and Learning Problems • Input and output variables x z System y • • Learning ~ estimation of f(x): xy Loss function L( y, f (x)) measures the quality of prediction yˆ f (x) • Loss function: - defined for common learning tasks - has to be related to application requirements 9 Supervised Learning: Regression • • Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is univariate output (real-valued) Regression loss function L( y, f (x)) y f (x)2 Estimation of real-valued function xy 10 Supervised Learning: Classification • • Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is categorical output (class label) Loss function for binary classification: 0 if y f x L y, f x 1 if y f x Estimation of indicator function xy 11 Unsupervised Learning • • Data in the form (x), where - x is multivariate input (i.e. vector) Goal: data reduction or clustering Clustering = estimation of mapping X c 12 Inductive Learning Setting • Predictive Estimator observes samples (x ,y), and returns an estimated response yˆ f (x, w) • Recall ‘first-principle’ vs ‘empirical’ knowledge Two modes of inference: identification vs imitation • Minimization of Risk Loss(y, f(x,w)) dP(x,y) min 13 Example: Regression estimation Given: training data (xi , yi ), i 1,2,...n Find a function f (x, w ) that minimizes squared error for a large number (N) of future samples: N k 1 [( y k f (x k , w)] 2 min 2 dP(x,y) min (y f( x ,w)) BUT Future data is unknown ~ P(x,y) unknown 14 Discussion • • • Math formulation useful for quantifying - explanation ~ fitting error (training data) - generalization ~ prediction error Natural assumptions - future similar to past: stationary P(x,y), i.i.d.data - discrepancy measure or loss function, i.e. MSE What if these assumptions do not hold? 15 OUTLINE 2.0 Objectives 2.1 Data Encoding + Preprocessing 2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches - Parametric Modeling - Non-parametric Modeling - Data Reduction 2.4 Generalization and Complexity Control 2.5 Application Example 2.6 Summary 16 Parametric Modeling Given training data (xi , yi ), i 1,2,...n (1) Specify parametric model (2) Estimate its parameters (via fitting to data) • Example: Linear regression F(x)= (w x) + b 2 n y (w x ) b i 1 i i min 17 Parametric Modeling: Classification Given training data (xi , yi ), i 1,2,...n (a) Estimate linear decision boundary: (b) Estimate third-order decision boundary: 18 Non-Parametric Modeling Given training data (xi , yi ), i 1,2,...n Estimate the model (for given x 0) as ‘local average’ of the training data. Note: need to define ‘local’, ‘average’ • Example: k-nearest neighbors regression k f (x 0 ) y j 1 j k 19 Example of kNN Regression • Ten training samples from y x 2 0.1x N (0, 2 ), where 2 0.25 • Using k-nn regression with k=1 and k=4: 20 Data Reduction Approach Given training data, estimate the model as ‘compact encoding’ of the data. Note: ‘compact’ ~ # of bits to encode the model • Example: piece-wise linear regression How many parameters needed for two-linear-component model? 21 Data Reduction Approach (cont’d) Data Reduction approaches are commonly used for unsupervised learning tasks. • Example: clustering. Training data encoded by 3 points (cluster centers) H Issues: - How to find centers? - How to select the number of clusters? 22 Standard Inductive Setting • Model estimation ~ inductive step, i.e. estimate function from data samples. • Prediction ~ deductive step (Standard) Inductive Learning Setting • • Discussion: which of the 3 modeling approaches follow standard inductive learning? How humans perform inductive inference? 23 OUTLINE 2.0 Objectives 2.1 Data Encoding + Preprocessing 2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches 2.4 Generalization and Complexity Control - Prediction Accuracy (generalization) - Complexity Control: examples - Resampling 2.5 Application Example 2.6 Summary 24 Prediction Accuracy • • • • • All modeling approaches implement ‘data fitting’ ~ explaining the data BUT the true goal ~ prediction Model explanation ~ fitting error, training error, empirical risk Prediction accuracy ~ generalization, test error, prediction risk Trade-off between training and test error is controlled by ‘model complexity’ 25 Explanation vs Prediction (a) Classification (b) Regression 26 Complexity Control: parametric modeling Consider regression estimation • Ten training samples y x 2 N (0, 2 ), where 2 0.25 • Fitting linear and 2-nd order polynomial: 27 Complexity Control: local estimation Consider regression estimation • Ten training samples from y x 2 N (0, 2 ), where 2 0.25 • Using k-nn regression with k=1 and k=4: 28 Complexity Control (cont’d) • Complexity (of admissible models) affects generalization (for future data) • Specific complexity indices for – Parametric models: ~ # of parameters – Local modeling: size of local region – Data reduction: # of clusters • Complexity control = choosing good complexity (~ good generalization) for a given (training) data 29 How to Control Complexity ? • Two approaches: analytic and resampling • Analytic criteria estimate prediction error as a function of fitting error and model complexity For regression problems: DoF R r Remp n Example analytic criteria for regression 1 • Schwartz Criterion: r p, n 1 p1 p ln n • Akaike’s FPE: r p 1 p1 p 1 where p = DoF/n, n~sample size, DoF~degrees-of-freedom 30 Resampling • Split available data into 2 sets: Training + Validation (1) Use training set for model estimation (via data fitting) (2) Use validation data to estimate the prediction error of the model • Change model complexity index and repeat (1) and (2) • Select the final model providing lowest (estimated) prediction error BUT results are sensitive to data splitting 31 K-fold cross-validation 1. Divide the training data Z into k randomly selected disjoint subsets {Z1, Z2,…, Zk} of size n/k 2. For each ‘left-out’ validation set Zi : - use remaining data to estimate the model yˆ f i (x) k 2 - estimate prediction error on Zi : ri f i (x) y k nZ 1 3. Estimate ave prediction risk as Rcv ri k i 1 i 32 Example of model selection(1) • 25 samples are generated as y sin 2 2x with x uniformly sampled in [0,1], and noise ~ N(0,1) • Regression estimated using polynomials of degree m=1,2,…,10 • Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the polynomial model, along with training (* ) and validation (*) data points, for one partitioning. m Estimated R via Cross validation 1 0.1340 2 0.1356 3 0.1452 4 0.1286 5 0.0699 6 0.1130 7 0.1892 8 0.3528 9 0.3596 10 0.4006 33 Example of model selection(2) • Same data set, but estimated using k-nn regression. • Optimal value k = 7 chosen according to 5-fold cross-validation model selection. The curve shows the k-nn model, along with training (* ) and validation (*) data points, for one partitioning. k Estimated R via Cross validation 1 0.1109 2 0.0926 3 0.0950 4 0.1035 5 0.1049 6 0.0874 7 0.0831 8 0.0954 9 0.1120 10 0.1227 34 Test Error • Previous example shows two models (estimated from the same data by different methods) • Which model is better? (has lower test error). Note: Poly model has lower cross-validation error • Double resampling (for estimating test error) Partition the data into: Learning/ validation/ test • Test data should be never used for model estimation 35 Application Example • Haberman’s Survival Data Set - 5-year survival of female patients (following surgery for breast cancer) - 306 cases (patients) - inputs: Age, number of positive auxilliary nodes • Method: k-NN classifier (only odd k-values) Note: input values are pre-scaled to [0,1] • Model selection via LOO cross-validation • Optimal k=45 yields min LOO error 22.75% 36 Model selection for k-NN classifier via cross-validation Optimal decision boundary for k=45 k 1 3 7 15 … 45 47 51 53 57 61 99 Error (%) 42 30.67 26 24.33 …. 21.67 22.33 23 24.33 24 25 26.33 37 Estimating test error of a method • • For the same example (Haberman’s data) what is the true test error of k-NN method ? Use double resampling, i.e. 5-fold cross validation to estimate test error, and LOO cross-validation to estimate optimal k for each training fold: 1 2 3 4 5 mean Optimal k LOO error Test error 11 37 37 33 35 28.33% 13.33% 16.67% 25% 35% 23.67% 22.5% 25% 23.33% 24.17% 18.75% 22.75% Note: opt k-values are different; ave test error is larger than ave validation error. 38 Summary and Discussion • Learning as function estimation (from data) ~ standard inductive learning setting • Common types of learning problems: classification, regression, clustering • Non-standard learning settings • Model estimation via data fitting (ERM) • Model complexity and generalization: - how to measure model complexity - several complexity (tuning) parameters 39