Download Optimization and methods to avoid overfitting

Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College London [email protected] Outline of Presentation • • • • • Bayesian inference The futility of bias-free learning Overfitting and no free lunch for Occam’s razor Bayesian model selection Bayesian model averaging Terminology • A model is a family of functions function: f(x) = 3x + 4 model: f(x) = ax + b • A complex model has a large volume in parameter space. If we know how complex a function is, with some assumptions, we can determine how ‘surprising’ it is. Statistics vs Machine Learning Which paradigm better describes our aims? • Statistics: test a given hypothesis • Machine learning: formulate the process of generalization as a search through possible hypotheses in an attempt to find the best hypothesis Answer: machine learning Classical Statistics vs Bayesian Inference Which paradigm tells us what we want to know? • Classical Statistics P(data|null hypothesis) • Bayesian Inference P(hypothesis|data, background information) Answer: Bayesian inference Problems with Classical Statistics • • • • The nature of the null hypothesis test Prior information is ignored Assumptions swept under the carpet p values are irrelevant (which leads to incoherence) and misleading Bayesian Inference • Definition of a Bayesian: a Bayesian is willing to put a probability on a hypothesis • Bayes’ theorem is a trivial consequence of the product rule • Bayesian inference tells us how we should update our degree of belief in a hypothesis in the light of new evidence • Bayesian analysis is more than merely a theory of statistics, it is a theory of inference. • Science is applied Bayesian analysis • Everyone should be a Bayesian! Bayes' Theorem B = background information H = hypothesis D = data P(H|B) = prior P(D|B) = probability of the data P(D|H&B) = likelihood P(H|D&B) = posterior P(H|D&B) = P(H|B)P(D|H&B)/P(D|B) Bayesian Inference • There is no such thing as an absolute probability, P(H), but we often omit B, and write P(H) when we mean P(H|B). • An implicit rule of probability theory is that any random variable not conditioned on is marginalized over. • The denominator in Bayes’ theorem, P(D|B), is independent of H, so when comparing hypotheses, we can omit it and use P(H|D)  P(H)P(D|H). The Futility of Bias-Free Learning • ‘Even after the observation of the frequent conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience.’ Hume (1739–40) • Bias-free learning is futile (Mitchell 1980; Schaffer 1994; Wolpert 1996). • One can never generalize beyond one’s data without making at least some assumptions. No Free Lunch Theorem The no free lunch (NFL) theorem for supervised machine learning (Wolpert 1996) tells us that, on average, all algorithms are equivalent. Note that the NFL theorems apply to off-training set generalization error, i.e., generalization error for test sets that contain no overlap with the training set. • No free lunch for Occam’s razor • No free lunch for overfitting avoidance • No free lunch for cross validation Occam’s Razor • Occam's razor (also spelled Ockham's razor) is a law of parsimony: the principle gives precedence to simplicity; of two competing theories, the simplest explanation of an entity is to be preferred. • Attributed to the 14th-century English logician and Franciscan friar, William of Ockham. • There is no free lunch for Occam’s razor. Bayesian Inference and Occam’s Razor If the data fits the following two hypotheses equally well, which should be preferred? H1: f(x) = bx + c H2: f(x) = ax2 + bx + c Recall that P(H|D)  P(H)P(D|H) with H the hypothesis and D the data; assume equal priors, so just consider the likelihood, P(D|H). Because there are more parameters, and the probability must sum to 1, the probability mass of the more complex model, P(D|H2) will be more ‘spread out’ than P(D|H1), so if the data fits equally well, the simpler model, model H1, should be preferred. In other words, Bayesian inference automatically and quantitatively embodies Occam's razor. Bayesian Inference and Occam’s Razor D data sets H1 simple model H2 complex model For data sets in region C, H1 is more probable. Occam’s Razor from First Principles? Bayesian model selection appears to formally justify Occam's razor from first principles. Alas, this is too good to be true, it contradicts the no free lunch theorem. Our ‘proof’ of Occam’s razor involved an element of smoke and mirrors. Ad hoc assumptions: • The set of models with a non-zero prior is extremely small, i.e. all but a countable number of models have exactly zero probability. • A flat prior over models (corresponds to a non-flat prior over functions). When choosing priors, should the ‘principle of insufficient reason’ be applied to functions or models? Cross Validation • Cross-validation (Stone 1974, Geisser 1975) is the practice of partitioning a sample of data into subsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis. • There is no free lunch for cross-validation. Overfitting • Overfitting avoidance cannot be justified from first principles. • The distinction between structure and noise can not be made on the basis of training data, so overfitting avoidance cannot be justified from the training set alone. • Overfitting avoidance is not an inherent improvement, but a bias. Underfitting and Overfitting Bias-Variance Trade-Off Consider a training set, the target (true) function, and an estimator (your guess). • Bias The extent to which the average (over all samples from the training set) of the estimator differs from the desired function. • Variance The extent to which the estimator fluctuates around its expected value as the samples from the training set varies. Bias-Variance Trade-Off: Formula X = input space f = target h = hypothesis d = training set m = size of d C = cost Y = output space q = test set point YF = target Y-values YH = hypothesis Y-values σf2 = intrinsic error due to f E(C | f, m, q) = σf2 + (bias)2 + variance, where σf2 ≡ E(YF2 | f, q) - [E(YF | f, q)]2, bias ≡ E(YF | f, q) - E(YH | f, q), variance ≡ E(YH2 | f, q) - [E(YH| f, q)]2. Bias-Variance Trade-Off: Issues C = cost, d = training set, m = size of the training set, f = target • There need not always be a bias-variance tradeoff, because there exists an algorithm with both zero bias and zero variance. • The bias-plus-variance formula ‘examines the wrong quantity’. In the real world, it is almost never E(C | f, m) that is directly of interest, but rather E(C | d), which is what a Bayesian is interested in. Model Selection 1) Model selection - Difficult! Choose from f(x) = ax2 + bx + c or f(x) = bx + c 2) Parameter estimation - Easy! Given f(x) = bx + c, find a and b Model selection is the task of choosing a model with the correct inductive bias. Bayesian Model Selection • Overfitting problem was solved in principle by Sir Harold Jeffreys in 1939 • Chooses the model with the largest posterior probability • Works with nested or non-nested models • No need for a validation set • No ad hoc penalty term (except the prior) • Informs you of how much structure can be justified by the given data • Consistent Pedagogical Example: Data • • • • • • • GBP to USD interbank rate Daily data Exclude zero returns (weekends) Average ask price for the day 1 January 1993 to 3 February 2008 Training set: 3402 data points Test set: 1701 data points Tobler's First Law of Geography • Tobler's first law of geography (Tobler 1970) tells us that ‘everything is related to everything else, but near things are more related than distant things’. • We use this common sense principle to select and prioritize our inputs. Example: Inputs and Target 5 potential inputs, xn, and a target y pn is exchange rate n days in the future x1 = log(p0/p-1) x2 = log(p-1/p-3) x3 = log(p-3/p-6) x4 = log(p-6/p-13) x5 = log(p-13/p-27) y = log(p1/p0) Example: 5 models m1 = a11x1 + a10 m2 = a22x2 + a21x1 + a20 m3 = a33x3 + a32x2 + a31x1 + a30 m4 = a44x4 + a43x3 + a42x2 + a41x1 + a40 m5 = a55x5 + a54x4 + a53x3 + a52x2 + a51x1 + a50 Example: Assigning Priors 1 Assumption: rather than setting a uniform prior across models, select a uniform prior across functions. P(m)  volume in parameter space Assume that a, b, c  [-5, 5] Model a ax + b ax + by + c Volume 111 112 113 Example: Assigning Priors 2 How likely is each model? In practice, the efficient market hypothesis implies that the simplest of functions are less likely. We shall penalize our simplest model. Example: Model Priors P(m1) = c × 112 × 0.1 = 0.000006 P(m2) = c × 113 = 0.000683 P(m3) = c × 114 = 0.007514 P(m4) = c × 115 = 0.082650 P(m5) = c × 116 = 0.909147 Marginal Likelihood The marginal likelihood is the marginal probability of the data, given the model, and can be obtained by summing (more generally, integrating) the joint probabilities over all parameters, θ. P(data|model) = ∫θP(data|model,θ)P(θ|model)dθ Bayesian Information Criterion (BIC) BIC is easy to calculate and enables us to approximate the marginal likelihood n = number of data points k = number of free parameters RSS is the residual sum of squares BIC = n ln(RSS/n) + k ln(n) marginal likelihood  e-0.5BIC Example: Model Likelihoods m1 m2 m3 m4 m5 n 3402 3402 3402 3402 3402 k 2 3 4 5 6 RSS 0.05643 0.05640 0.05634 0.05633 0.05629 BIC Marginal likelihood -37429 0.9505 -37423 0.0443 -37419 0.0051 -37411 9.2218×10-5 -37405 5.2072×10-6 Example: Model Posteriors P(model|data)  prior × likelihood P(m1|data) = c × 6.21×10-6 × 0.95052 = 0.068 P(m2|data) = c × 0.00068 × 0.04429 = 0.349 P(m3|data) = c × 0.00751 × 0.00509 = 0.441 P(m4|data) = c × 0.08265 × 9.22×10-5 = 0.088 P(m5|data) = c × 0.90915 × 5.21×10-6 = 0.055 Example: Best Model We can choose the best model, the model with the highest posterior probability: Model 1: 0.068 Model 2: 0.349 high Model 3: 0.441 highest Model 4: 0.088 Model 5: 0.055 Example: Out of Sample Results 0.6 0.5 Model Model Model Model Model 0.4 0.3 0.2 0.1 0 Log returns over 5 years 1 2 3 4 5 Bayesian Model Averaging • We chose the most probable model. • But we can do better than that! • It is optimal to take an average over all models, with each model’s prediction weighted by its posterior probability. Example: Out of Sample Results Including Model Averaging 0.6 Model 1 0.5 Model 2 0.4 Model 3 0.3 Model 4 0.2 Model 5 0.1 0 Log returns over 5 years Model averaging Conclusions • Our community typically worries about overfitting avoidance and statistical significance, but our practical successes have been due to the appropriate application of bias. • Be a Bayesian and use domain knowledge to make intelligent assumptions and adhere to the rules of probability. • How ‘aligned’ your learning algorithm is with the domain determines how well you will generalize. Questions? This PowerPoint presentation is available here: http://www.cs.ucl.ac.uk/staff/M.Sewell/Sewell2008.ppt Martin Sewell [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Optimization and methods to avoid overfitting