* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download EE Dept presentation 06
Survey
Document related concepts
Transcript
Introduction to Predictive Learning LECTURE SET 2 Basic Learning Approaches and Complexity Control Electrical and Computer Engineering 1 OUTLINE 2.0 Objectives 2.1 Data Encoding + Preprocessing 2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches 2.4 Generalization and Complexity Control 2.5 Application Example 2.6 Summary 2 2.0 Objectives 1. To quantify the notions of explanation, prediction and model 2. Introduce terminology 3. Describe basic learning methods 4. Importance of complexity control for generalization 3 Learning as Induction Induction ~ function estimation from data: Deduction ~ prediction for new inputs: aka Standard Inductive Learning Setting 4 2.1 Data Encoding + Preprocessing Common Types of Input & Output Variables (input variables ~ features) • Real-valued • Categorical (class labels) • Ordinal (or fuzzy) variables • Classification: categorical output • Regression: real-valued output • Ranking: ordinal output 5 • Data Preprocessing and Scaling Preprocessing is required with observational data (step 4 in general experimental procedure) • Basic preprocessing includes - summary univariate statistics: mean, st. deviation, min + max value, range, boxplot performed independently for each input/output - detection (removal) of outliers - scaling of input/output variables (may be necessary for some learning algorithms) • Visual inspection of data is tedious but useful 6 Animal Body & Brain Weight Data (original, unscaled) 7 Removing Outliers Remove outliers: Brachiosaurus, Diplodocus, Triceratop, African elephant, Asian elephant and plot the data scaled to [0,1] range: 1 0.9 0.8 0.7 Brain weight • 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Body weight 0.7 0.8 0.9 1 8 2.2 Terminology and Learning Problems • Input and output variables x z System y • • Learning ~ estimation of f(x): xy Loss function L( y, f (x)) measures the quality of prediction yˆ f (x) • Loss function: - defined for common learning tasks - has to be related to application requirements 9 Supervised Learning: Regression • • Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is univariate output (real-valued) Regression loss function L( y, f (x)) y f (x)2 Estimation of real-valued function xy 10 Supervised Learning: Classification • • Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is categorical output (class label) Loss function for binary classification: 0 if y f x L y, f x 1 if y f x Estimation of indicator function xy 11 Unsupervised Learning • • Data in the form (x), where - x is multivariate input (i.e. vector) Goal: data reduction or clustering Clustering = estimation of mapping X c 12 Inductive Learning Setting • Predictive Estimator observes samples (x ,y), and returns an estimated response yˆ f (x, w) • Recall ‘first-principle’ vs ‘empirical’ knowledge Two modes of inference: identification vs imitation • Minimization of Risk Loss(y, f(x,w)) dP(x,y) min 13 Example: Regression estimation Given: training data (xi , yi ), i 1,2,...n Find a function f (x, w ) that minimizes squared error for a large number (N) of future samples: N k 1 [( y k f (x k , w)] 2 min 2 dP(x,y) min (y f( x ,w)) BUT Future data is unknown ~ P(x,y) unknown 14 Discussion • • • Math formulation useful for quantifying - explanation ~ fitting error (training data) - generalization ~ prediction error Natural assumptions - future similar to past: stationary P(x,y), i.i.d.data - discrepancy measure or loss function, i.e. MSE What if these assumptions do not hold? 15 OUTLINE 2.0 Objectives 2.1 Data Encoding + Preprocessing 2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches - Parametric Modeling - Non-parametric Modeling - Data Reduction 2.4 Generalization and Complexity Control 2.5 Application Example 2.6 Summary 16 Parametric Modeling Given training data (xi , yi ), i 1,2,...n (1) Specify parametric model (2) Estimate its parameters (via fitting to data) • Example: Linear regression F(x)= (w x) + b 2 n y (w x ) b i 1 i i min 17 Parametric Modeling: Classification Given training data (xi , yi ), i 1,2,...n (a) Estimate linear decision boundary: (b) Estimate third-order decision boundary: 18 Non-Parametric Modeling Given training data (xi , yi ), i 1,2,...n Estimate the model (for given x 0) as ‘local average’ of the training data. Note: need to define ‘local’, ‘average’ • Example: k-nearest neighbors regression k f (x 0 ) y j 1 j k 19 Example of kNN Regression • Ten training samples from y x 2 0.1x N (0, 2 ), where 2 0.25 • Using k-nn regression with k=1 and k=4: 20 Data Reduction Approach Given training data, estimate the model as ‘compact encoding’ of the data. Note: ‘compact’ ~ # of bits to encode the model • Example: piece-wise linear regression How many parameters needed for two-linear-component model? 21 Data Reduction Approach (cont’d) Data Reduction approaches are commonly used for unsupervised learning tasks. • Example: clustering. Training data encoded by 3 points (cluster centers) H Issues: - How to find centers? - How to select the number of clusters? 22 Standard Inductive Setting • Model estimation ~ inductive step, i.e. estimate function from data samples. • Prediction ~ deductive step (Standard) Inductive Learning Setting • • Discussion: which of the 3 modeling approaches follow standard inductive learning? How humans perform inductive inference? 23 OUTLINE 2.0 Objectives 2.1 Data Encoding + Preprocessing 2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches 2.4 Generalization and Complexity Control - Prediction Accuracy (generalization) - Complexity Control: examples - Resampling 2.5 Application Example 2.6 Summary 24 Prediction Accuracy • • • • • All modeling approaches implement ‘data fitting’ ~ explaining the data BUT the true goal ~ prediction Model explanation ~ fitting error, training error, empirical risk Prediction accuracy ~ generalization, test error, prediction risk Trade-off between training and test error is controlled by ‘model complexity’ 25 Explanation vs Prediction (a) Classification (b) Regression 26 Complexity Control: parametric modeling Consider regression estimation • Ten training samples y x 2 N (0, 2 ), where 2 0.25 • Fitting linear and 2-nd order polynomial: 27 Complexity Control: local estimation Consider regression estimation • Ten training samples from y x 2 N (0, 2 ), where 2 0.25 • Using k-nn regression with k=1 and k=4: 28 Complexity Control (cont’d) • Complexity (of admissible models) affects generalization (for future data) • Specific complexity indices for – Parametric models: ~ # of parameters – Local modeling: size of local region – Data reduction: # of clusters • Complexity control = choosing good complexity (~ good generalization) for a given (training) data 29 How to Control Complexity ? • Two approaches: analytic and resampling • Analytic criteria estimate prediction error as a function of fitting error and model complexity For regression problems: DoF R r Remp n Example analytic criteria for regression 1 • Schwartz Criterion: r p, n 1 p1 p ln n • Akaike’s FPE: r p 1 p1 p 1 where p = DoF/n, n~sample size, DoF~degrees-of-freedom 30 Resampling • Split available data into 2 sets: Training + Validation (1) Use training set for model estimation (via data fitting) (2) Use validation data to estimate the prediction error of the model • Change model complexity index and repeat (1) and (2) • Select the final model providing lowest (estimated) prediction error BUT results are sensitive to data splitting 31 K-fold cross-validation 1. Divide the training data Z into k randomly selected disjoint subsets {Z1, Z2,…, Zk} of size n/k 2. For each ‘left-out’ validation set Zi : - use remaining data to estimate the model yˆ f i (x) k 2 - estimate prediction error on Zi : ri f i (x) y k nZ 1 3. Estimate ave prediction risk as Rcv ri k i 1 i 32 Example of model selection(1) • 25 samples are generated as y sin 2 2x with x uniformly sampled in [0,1], and noise ~ N(0,1) • Regression estimated using polynomials of degree m=1,2,…,10 • Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the polynomial model, along with training (* ) and validation (*) data points, for one partitioning. m Estimated R via Cross validation 1 0.1340 2 0.1356 3 0.1452 4 0.1286 5 0.0699 6 0.1130 7 0.1892 8 0.3528 9 0.3596 10 0.4006 33 Example of model selection(2) • Same data set, but estimated using k-nn regression. • Optimal value k = 7 chosen according to 5-fold cross-validation model selection. The curve shows the k-nn model, along with training (* ) and validation (*) data points, for one partitioning. k Estimated R via Cross validation 1 0.1109 2 0.0926 3 0.0950 4 0.1035 5 0.1049 6 0.0874 7 0.0831 8 0.0954 9 0.1120 10 0.1227 34 Test Error • Previous example shows two models (estimated from the same data by different methods) • Which model is better? (has lower test error). Note: Poly model has lower cross-validation error • Double resampling (for estimating test error) Partition the data into: Learning/ validation/ test • Test data should be never used for model estimation 35 Application Example • Haberman’s Survival Data Set - 5-year survival of female patients (following surgery for breast cancer) - 306 cases (patients) - inputs: Age, number of positive auxilliary nodes • Method: k-NN classifier (only odd k-values) Note: input values are pre-scaled to [0,1] • Model selection via LOO cross-validation • Optimal k=45 yields min LOO error 22.75% 36 Model selection for k-NN classifier via cross-validation Optimal decision boundary for k=45 k 1 3 7 15 … 45 47 51 53 57 61 99 Error (%) 42 30.67 26 24.33 …. 21.67 22.33 23 24.33 24 25 26.33 37 Estimating test error of a method • • For the same example (Haberman’s data) what is the true test error of k-NN method ? Use double resampling, i.e. 5-fold cross validation to estimate test error, and LOO cross-validation to estimate optimal k for each training fold: 1 2 3 4 5 mean Optimal k LOO error Test error 11 37 37 33 35 28.33% 13.33% 16.67% 25% 35% 23.67% 22.5% 25% 23.33% 24.17% 18.75% 22.75% Note: opt k-values are different; ave test error is larger than ave validation error. 38 Summary and Discussion • Learning as function estimation (from data) ~ standard inductive learning setting • Common types of learning problems: classification, regression, clustering • Non-standard learning settings • Model estimation via data fitting (ERM) • Model complexity and generalization: - how to measure model complexity - several complexity (tuning) parameters 39