Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of Data Mining Session 1: Introduction Joaquin Vanschoren Mykola Pechenizkiy Eindhoven University of Technology Anne Driemel Course Overview Course Details I Lecturers • Joaquin Vanschoren ([email protected]) MF 7.104a • Mykola Pechenizkiy ([email protected]) MF 7.099 • Anne Driemel ([email protected]) MF 7.073 Contact hours • Mondays, 9:30 - 10:30: Questions and Answers (PAV J17) • Mondays, 10:45 - 12:30: Plenary Lectures (PAV J17) • Thursdays, 13:45 - 15:30: Plenary Lectures (AUD 16) More info: • http://www.win.tue.nl/⇠jvanscho/#!datamining Course Details II Materials: all lecture slides, assignments posted to Canvas • http://canvas.win.tue.nl/courses/5 • No set course book, but recommended reading for each lecture Evaluation • No exam, only assignments (3 problem sets) • Problem set 1: available Feb 1 • 3 parts: deadlines Feb 18, Feb 25, Mar 3 (noon) • Problem set 2: available Feb 29 • Problem set 3: available Mar 21 • Required knowledge is discussed in lectures • Work in teams of 2 students, rotated between problem sets • Passing grade: 6/10 over all assignments Learning objectives • Understand how data mining algorithms algorithms work • Reason about when and how to use them, and apply them successfully in practice • Understand the mathematical/statistical foundations of data mining techniques • Run practical experiments to experience first-hand how data mining algorithms behave on real data. • Explore how algorithm parameters and data properties a↵ect the e↵ectiveness of predictive models • Recognize and formulate data analysis problems Topics I • Similarities and Distances • Clustering • Dimensionality Reduction • Using Machine Learning software (R, Python) • Rules and decision trees • Evaluation and optimization • Instance-based Learning • Kernel methods • Ensemble Learning • Neural Networks • Bayesian Learning Topics II Not covered: • Reinforcement Learning • Genetic Algorithms (only briefly) • Handeling complex input data • Text, Images, Sensors,... • We study the core such algorithms, not how data is represented • Itemset Mining • Association Rule Mining • Outlier detection Lecture Overview Lecture Overview • Prologue: What is machine learning? • How can we learn? The problem of induction. • Simple strategies • No free lunch theorem • Overfitting • The 5 ‘tribes’ of machine learning • • • • Symbolic Learning: Express, manipulate symbolic knowledge Neural Networks (Connectionism): Mimick the Human brain Evolution: Simulate the evolutionary process Probabilistic (Bayesian) Inference: Reduce uncertainties by incorporating new evidence • Learning by Analogy: Recognize similarities between old and new situations Prologue Machine Learning is changing our world • Search engines learn what you want • Recommenders learn your taste in books, music, movies,... • Algorithms do automatic stock trading • Elections are won by understanding voters • Google Translate learns how to translate text • Siri learns to understand speech • DeepMind beats humans at Go • Cars drive themselves • Medicines are developed faster • Smartwatches monitor your health • Data-driven discoveries are made in Physics, Biology, Genetics, Astronomy, Chemistry, Neurology,... What’s in a name? • Many names: data mining, data science, machine learning, statistical learning,... • Subtle di↵erences in scope, partly marketing • We’ll mostly use the term ’machine learning’ • Has deep roots in statistics, neurology, biology, psychology,... but has developed into a new field of study. • How is it di↵erent from statistics? • Breiman. Statistical Modeling: The Two Cultures. Statistical Science, 2001 Two cultures: Classical statistics x System y Assume that one can sufficiently describe the unknown system by a stochastic model with given parametric function class for f (): y = f (x, ✓) + ✏ Assuming that the model is correct, we can do hypothesis tests, variance analysis, model comparison, confidence intervals,... Two cultures: machine learning I x System y Approximation by neuronal network, tree, ... The system is considered as unknown, explicit modelling is not even tried. Essential measure of goodness is the prediction quality. Two cultures: machine learning II • In statistical literature, many articles start with ”‘Assume that the data are generated by the following model . . . ”’. • Advantages: Data model can be interpreted if the model is easy enough, has good theory for model diagnostics • Disadvantages (Breiman 2001): “Irrelevant” theory, doesn’t consider many interesting problems • In machine learning, algorithms are studied based on example from nature, intuitive behavior or computational attractivity. • Advantages: Many more types of models are available, quicker implementation of new ideas. • Disadvantages: Models mostly hard to interpret (“black box”) • Bluntly, statistics tries to help Human understand the system, machine learning tries to replace Human faculties. • Today, both fields are learning a lot from each other. Machine learning tasks • Supervised Learning • Learn the relationship between ”input” x and ”output” y : search for a function f , such that y ⇡ f (x) • There is training data with labels available Regression: y is metric variable (with values in R) Classification: y is categorical variable (unordered, discrete). • Semi-supervised learning: also uses available unlabeled data, e.g. assumes that similar inputs have similar outputs. • Unsupervised Learning • There exist no outputs, search for patterns within the inputs x Clustering: find groups of similar items Dimensionality reduction: describe data in fewer features Outlier detection: what is out of the ordinary? Association rules: which things often happen together? How can we learn? To date or not to date? Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes Yes No ? To date or not to date? Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes Yes No ? Some terminology Rows: Instances, Examples (labelled/unlabelled) Columns: Factors, Features, Attributes Last column: Target feature First column: Identifier (Never give this to the learner!) Other columns: Predictive features To date or not to date? Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes Yes No ? • Is there one factor that perfectly predicts the answer? To date or not to date? Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes Yes No ? • Is there one factor that perfectly predicts the answer? • What about a conjunction of factors? To date or not to date? Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes Yes No ? • Is there one factor that perfectly predicts the answer? • What about a conjunction of factors? • Warm & Weekend ! No date today :( • Club & TV Bad ! Date :) To date or not to date? Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes Yes No ? • Is there one factor that perfectly predicts the answer? • What about a conjunction of factors? • • • • Warm & Weekend ! No date today :( Club & TV Bad ! Date :) Club & Warm ! No date... Weekend & TV Bad ! Date... • There’s no way to know! Hume’s problem of induction (David Hume, 1748) How can we ever be justified in generalizing from what we’ve seen to what we haven’t? • You have no basis to pick one generalization over the other Hume’s problem of induction (David Hume, 1748) How can we ever be justified in generalizing from what we’ve seen to what we haven’t? • You have no basis to pick one generalization over the other • Big data (Casanova-approach) won’t help: answer may depend on a factor you didn’t consider • What if the answer is just random? Hume’s problem of induction What if we just assume that the future will be like the past? • Risky assumption (e.g. inductivist turkey) • Still the best we can do (and generally seems to work) • Even then, this only helps if we have seen the exact same situation before • The machine learning problem remains: How do we generalize from cases that we haven’t seen before? • What if somebody types a unique Google query? • What if a patient comes in with slightly di↵erent symptoms? • What if someone writes a new unique spam email? • Even with all the data in the world, your chances or finding the exact same case are almost zero. We need induction. No Free Lunch Theorem (David Wolpert, 1977) No Free Lunch Theorem If all functions f (x) = y are equally likely, all algorithms that aim to optimize that function have identical performance. • Sets a limit on how good a learner can be: no learner can be better than random guessing! • But then why is the world full of highly successful learners? No Free Lunch Theorem (David Wolpert, 1977) No Free Lunch Theorem If all functions f (x) = y are equally likely, all algorithms that aim to optimize that function have identical performance. • Sets a limit on how good a learner can be: no learner can be better than random guessing! • But then why is the world full of highly successful learners? • For every world where a learner does better than random guessing, we can construct an anti-world by flipping the labels of all unseen instances: it performs worse by the same amount No Free Lunch Theorem (David Wolpert, 1977) No Free Lunch Theorem If all functions f (x) = y are equally likely, all algorithms that aim to optimize that function have identical performance. • Sets a limit on how good a learner can be: no learner can be better than random guessing! • But then why is the world full of highly successful learners? • For every world where a learner does better than random guessing, we can construct an anti-world by flipping the labels of all unseen instances: it performs worse by the same amount • We don’t care about all possible worlds, only the one we live in • We assume we know something about this world that gives us an advantage. That knowledge is fallible, but it’s a risk we’ll have to take The futility of bias-free learning Practical consequence: There’s no such thing as learning without knowledge (assumptions). Data alone is not enough. • We need to provide prior knowledge to the algorithm, or make assumptions when constructing hypotheses • The structure of a neural net, a Bayesian prior, background knowledge as rules, the way a tree represents knowledge,... • These assumptions are called a learner’s bias, i.e. bias-free learning is impossible. • Every new piece of knowledge is the basis for more knowledge. • What assumptions can we start from that are not too strong? Newton’s Principle of induction: Whatever is true of everything we’ve seen, is true for everything in the universe • We induce the most widely applicable rules we can, and reduce scope only when the data forces us to. Learning = Representation + Evaluation + Optimization All learners consist of three main components (all introducing bias): • Representation: A model must be represented in a formal language that the computer can handle. • Defines the concepts it can learn: The hypothesis space • Evaluation: How to choose one hypothesis over the other? • The evaluation function, objective function, scoring function • Can di↵er from the external evaluation function (e.g. accuracy) • Optimization: How do we search the hypothesis space? • Key to the efficiency of the learner • Defines how many optima it finds • Often starts from most simple hypothesis, relaxing it if needed to explain the data A dating algorithm Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes Yes No ? • Representation: conjunctions of factors (conjunctive concepts) • Optimization: start with best 1-factor concept, then add best other factor on remaining data • Don’t try all combinations (combinatorial explosion)! • Evaluation: exclude most bad matches and fewest good ones A dating algorithm Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes Yes No ? • Representation: conjunctions of factors (conjunctive concepts) • Optimization: start with best 1-factor concept, then add best other factor on remaining data • Don’t try all combinations (combinatorial explosion)! • Evaluation: exclude most bad matches and fewest good ones • Result: Weekend ^ Warm (! No date) • Thoughts? Learning sets of rules (Michalski) • Real concepts are disjunctive ! sets of rules • Credit card used in 3 di↵erent continents yesterday ! stolen • Credit card used twice after 23:00 on weekday ! stolen • Credit card used to buy 1 dollar of gas ! stolen Learning sets of rules (Michalski) • Real concepts are disjunctive ! sets of rules • Credit card used in 3 di↵erent continents yesterday ! stolen • Credit card used twice after 23:00 on weekday ! stolen • Credit card used to buy 1 dollar of gas ! stolen • Divide and conquer approach: build rule covering as many positive examples as possible, discard all positive examples that it covers, repeat until all are covered • Sets of rules can represent any concept. How? Learning sets of rules (Michalski) • Real concepts are disjunctive ! sets of rules • Credit card used in 3 di↵erent continents yesterday ! stolen • Credit card used twice after 23:00 on weekday ! stolen • Credit card used to buy 1 dollar of gas ! stolen • Divide and conquer approach: build rule covering as many positive examples as possible, discard all positive examples that it covers, repeat until all are covered • Sets of rules can represent any concept. How? • Just turn each positive instance into a rule using all factors • Weekend ^ Club ^ Warm ^ Bad ! Yes • 100% accurate rule? Or an illusion? Learning sets of rules (Michalski) • Real concepts are disjunctive ! sets of rules • Credit card used in 3 di↵erent continents yesterday ! stolen • Credit card used twice after 23:00 on weekday ! stolen • Credit card used to buy 1 dollar of gas ! stolen • Divide and conquer approach: build rule covering as many positive examples as possible, discard all positive examples that it covers, repeat until all are covered • Sets of rules can represent any concept. How? • Just turn each positive instance into a rule using all factors • Weekend ^ Club ^ Warm ^ Bad ! Yes • 100% accurate rule? Or an illusion? • Every new (unseen) example will be negative • No free lunch: you can’t learn without assuming anything Overfitting • Learner finds a pattern in the data that is not actually true in the real world: overfits the data • Humans also overfit when they overgeneralize from an incomplete picture of the world • Every powerful learner can ‘hallucinate’ patterns • Happens when you have too many hypotheses and not enough data to tell them apart • The more data, the more ’bad’ hypotheses are eliminated • If the hypothesis space is not constrained, there may never be enough data • There is often a parameter that allows you to constrain (regularize) the learner Overfitting in rule learning Overfitting rule learner Non-overfitting rule learner classif.JRip: P=TRUE; E=TRUE Train: mmce=0.02; CV: mmce.test.mean=0.085 classif.JRip: Train: mmce=0.04; CV: mmce.test.mean=0.08 2.5 2.5 0.0 0.0 1 classes x.2 x.2 classes 1 2 −2.5 2 −2.5 −5.0 −5.0 −4 −2 0 x.1 2 −4 −2 0 2 x.1 Better training set performance Better test set performance (seen examples) (unseen examples) Another example Overfitting rule learner Non-overfitting rule learner classif.ksvm: fit=FALSE; kernel=rbfdot; C=1; sigma=100 Train: mmce=0.01; CV: mmce.test.mean=0.215 classif.ksvm: fit=FALSE; kernel=rbfdot; C=1; sigma=1 Train: mmce=0.045; CV: mmce.test.mean=0.055 2.5 2.5 0.0 0.0 1 classes x.2 x.2 classes 1 2 −2.5 2 −2.5 −5.0 −5.0 −4 −2 0 x.1 2 −4 −2 0 2 x.1 Better training set performance Better test set performance (seen examples) (unseen examples) Overfitting and noise • Overfitting is seriously exacerbated by noise (errors in the training data) • Unconstrained learner will model that noise • A popular misconception is that overfitting is always caused by noise Overfitting and noise Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Good Good Bad No Yes Yes No ? • We misremembered occasion 3 • If we model this instance, the resulting rule will be worse • Better to ignore a few instances that trying to get all correct Overfitting and noise Nr Day of Week Type of Date Weather TV Tonight Date? 1 2 3 4 Now Weekday Weekend Weekend Weekend Weekend Dinner Club Club Club Club Warm Warm Warm Cold Cold Bad Bad Bad Good Bad No Yes No No ? • There was another reason your friend said no that has nothing to do with you • You are better o↵ misclassifying it • Worse: it is now impossible to find any consistent set of rules Overfitting in regression models In regression, likewise, the complexity of the considered models needs to be limited. More data also helps here. Good model Overfitting Underfitting Avoiding Overfitting • You should never believe your model until you’ve verified it on data that the learner didn’t see • Scientific method applied to machine learning: model must make new predictions that can be experimentally verified • Randomly divide the data into: • Training set which you give to the learner • Test set which you hide to verify predictive performance • A learner can do this internally to stop short of perfectly fitting the data • Do a statistical significance test to see whether one hypothesis is significantly better than another • Throw out enough low-significance hypotheses • Prefer simpler hypotheses altogether (e.g. divide-and-conquer, regularization) Avoiding overfitting Model complexity can often be controlled by an algorithm (regularization) parameter. Optimizing it, by tracking the actual (test set) error, is always recommended. Overfitting Error Underfitting Actual error Apparent error Complexity A note on Occam’s razor Occam’s razor Prefer the simplest hypothesis that fits all the data • This does NOT reduce overfitting (popular misconception) • Simpler models are preferable for other reasons (e.g. computational and cognitive cost) Bias-Variance analysis • Overfitting can be better understood by decomposing the test set error into • Bias: Learner’s tendency to consistently misclassify certain instances (underfitting) • Variance: Learner’s tendency to learn random things irrespective of the real signal (overfitting) • Noise: Intrinsic error, independent from the learner • This can be done by comparing predictions after training the model on many random samples of the training data. Bias-Variance analysis Bias-Variance analysis # Reduce underfitting. Make model more flexible (or choose another). Try Boosting. Reduce overfitting. Make model less flexible (regularization), or add more data. Try Bagging. The 5 ‘tribes’ of Machine Learning The 5 paradigms Rival schools of thought in machine learning, each with core beliefs, and with distinct strategy to learn anything: • Symbolic Learning: Express, manipulate symbolic knowledge • Rules and trees • Neural Networks (Connectionism): Mimick the Human brain • Neural Nets • Evolution: Simulate the evolutionary process • Genetic algorithms • Probabilistic (Bayesian) Inference: Reduce uncertainties by incorporating new evidence • Graphical models, Gaussian processes • Learning by Analogy: Recognize similarities between old and new situations • kNN, Support Vector Machines Meant as overview: Concepts are explained later in the course Symbolic Learning • All intelligence can be reduced to manipulating symbols • Can incorporate preexisting knowledge (e.g. as rules) • Can combine knowledge, data, to fill in gaps (like scientists) Representation Rules, trees, first order logic rules Evaluation Accuracy, information gain Optimization Top-down induction, inverse deduction Algorithms Decision trees, Logic programs Symbolic Learning: Trees 2.5 Petal.Width 2 1.5 1 0.5 0 1 2 3 4 Petal.Length 5 6 7 Symbolic Learning: Trees 1 Petal.Length p < 0.001 ≤ 1.9 > 1.9 3 Petal.Width p < 0.001 ≤ 1.7 > 1.7 4 Petal.Length p < 0.001 ≤ 4.8 > 4.8 Node 2 (n = 50) Node 5 (n = 46) 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 setosa setosa Node 6 (n = 8) Node 7 (n = 46) 0 setosa setosa Symbolic Learning Deduction Socrates is human + humans are mortal = ? Inverse deduction Socrates is human + ? = Socrates is mortal • Background knowledge (e.g. gene interactions, metabolic pathways) can be expressed in first-order knowledge • Inverse deduction can infer new hypotheses • Robot scientist: learns hypotheses, then designs and runs experiments to test hypotheses Neural Networks • Learning is what the brain does: reverse-engineer it • Adjust strengths of connection between neurons • Can handle raw, high-dimensional data, constructs it own features Representation Neural network Evaluation Squared error Optimization Gradient descent Algorithms Backpropagation Neural Networks • Hebbian learning: Neurons that fire together, wire together • Backpropagation: assigns ’blame’ for errors to neurons earlier in the network • Many applications (deep learning). Evolution • Natural selection is the mother of all learning • Simulate evolution on a computer • Can learn structure, e.g. the shape of a brain Representation Genetic programs (often trees) Evaluation Fitness function Optimization Genetic search Algorithms Genetic programming (crossover, mutation) Probabilistic (Bayesian) Learning • Learning is a form of uncertain inference • Uses Bayes’ theorem to incorporate new evidence into our beliefs • Can deal with noisy, incomplete, contradictory data Representation Graphical models, Markov networks Evaluation Posterior probability Optimization Probabilistic inference Algorithms Bayes’ theorem and derivates Bayesian Learning • Choose hypothesis space + prior for each hypothesis • As evidence comes in, update probability of each hypothesis • Posterior: how likely is hypotheses after seeing the data Learning by Analogy • Learning is recognizing similarities between situations and inferring other similarities • Generalizes from similarity • Transfer solution from previous situations to new situations Representation Memory, support vectors Evaluation Margin Optimization Kernel machines Algorithms Nearest Neighbor, Support Vector Machines Nearest neighbors • Given cities belonging to 2 countries. Where is the border? • Nearest neighbor: point belongs to closest cities • k-Nearest neighbor: do vote over k nearest ones (smoother) Support Vector Machines (Kernel methods) • Only remember points that define border (support vectors) • Find linear border with maximal margin to nearest points " • If not linearly separable, transform the input space (kernel trick) Further Reading • P. Domingos. A Few Useful Things to Know about Machine Learning • P. Domingos. The Master Algorithm • G. James et al. An Introduction to Statistical Learning • P. Flach. Machine Learning