Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Center for Complex Engineering Systems Introduc)on to Applied Machine Learning Abdullah Almaatouq ([email protected]) Abdulaziz Alhassan (a.alhassan@cces-‐kacst-‐mit.org) Ahmad Alabdulkareem ([email protected]) Tutorial Outline • What is machine learning • How to learn • How to learn well • Prac)cal Session Examples of Machine Learning What is Machine Learning • Given computers the ability to learn without explicitly being programmed. – Arthur Samuel 1959 • Machine learning is a computer program that improves its performance in a certain task with experience. – Tom Mitchell (1998) What is Machine Learning • Making predic)ons, based on learned paSerns from historical data. • Not to be confused with data mining – Data mining focuses on exploi)ng paSerns in historical data. • New field? – Sta)s)cal Generaliza)on – Computa)onal Sta)s)cs Why Machine learning? • Huge amounts of data !!!! • Humans are expensive, computers are cheap. – 88ECU, 32CPU and 244GB RAM ≈ $3.50/h – Minimum worker wage in the US is $7.25/h • Rule based is (oden): – – – – Difficult to capture all relevant paSerns. You can’t add rules to paSerns you don’t know exist Very hard to maintain (oden) doesn’t work well • Psychic superpowers (some)mes!) When to use Machine Learning • There is a paSern to be detected • You have data Distance • You can NOT pin it mathema)cally d = r × t Time (t) Types of Problems • Supervised – Training from seen labeled examples to generalize to unseen new observa)ons. – Given some input xi predict ‘class/value’ of yi • Unsupervised – Find hidden structures in unlabeled data Types of Problems Machine Learning problems Supervised Regression Classifica)on Unsupervised Clustering Outlier Detec)on Supervised Learning • ClassificaBon – Binary: Given xi find yi in {-‐1, 1} – Mul)category : Given xi find yi in {-‐1, .. K} • Regression – Given xi find yi in R (or Rd) Spam VS Not Spam Images to digits Beak depth predic)on Supervised Learning Framework Unknown func)on f f: x ! y generates Training Examples (xi,yi) … .. (xn,yn) input Hypothesis set Op)miza)on rou)nes Error func)on New observa)on x h(x) es s o o ch Machine Learning input produces Known func)on g g ≈ f Predict value/category y Linear Regression Linear form: y = ax+b • Capture the collec)ve behavior of credit officers • Predict the credit line for new customers Bias 1 Age 23 years Annual salary $144000 Input x = Years in residence 4 years Years in job 2 years ….. … = [1,23,144000,4,2..] h(x) = θ1 x1 +... + θ d xd + θ 0 x0 d Linear regression output: h(x) = ∑θ x i i=0 i = θT x Linear Regression • The historical dataset (x1,y1), (x2,y2),…, (xN,yN) • y n ∈ R is the credit line for customer xn • How well our h(x) = θTx approximate f(x) – Squared error ( h(x) − f(x) )2 – E(h) = 1 N N ∑ (h(x ) − y n n )2 n=1 • Goal is to choose h that minimizes the E(h) Linear Regression y x1 x2 Illustra)on of 2d and 3d regression line and plane. It works the same in higher dimensions SeMng-‐up the problem E(h) = X= x T 1 T 2 . . . x Tn N ∑ (h(x ) − y n n )2 n=1 Features E(θ) = 1 N N ∑ (θ T xn − yn )2 n=1 1 || Xθ − y ||2 N = X= Observa)ons x 1 N Minimizing E(h) E(θ) = 1 || Xθ − y ||2 N E(θ) = 2 T X (Xθ − y) N Python: theta = linalg.inv (X.T*X)*X.T*y = 2 T X (Xθ − y) = 0 N = 2 (X T Xθ − X T y) = 0 N θ = (X T X )−1 X T y Inverse here is pseudo-‐inverse MATLAB: theta = inv(X'*X)*X'*y Linear Regression Supervised Learning Framework Credit line officers Unknown func)on f f: x ! y generates Training Examples (xi,yi) … .. (xn,yn) Historical credit line h(x) = ax+ b Error func)on Squared error New observa)on x input Hypothesis set Gradient Op)miza)on decent rou)nes New applica)ons h(x) es s o o ch Machine Learning Linear Regression input produces Known func)on g g ≈ f Predict Credit line value/category y Support Vector Machines • Separa)ng plane with maximum margin • Margins are configurable with parameters to allow for mistakes • Gold standard blackbox for many prac))oners Support Vector Machines • Effec)ve in high dimensional spaces • effec)ve even when #features > #observa)ons • It can uses different kernel func)ons Polynomial Linear RBF Supervised Learning Framework Unknown func)on f f: x ! y generates Training Examples (xi,yi) … .. (xn,yn) input Hypothesis set Op)miza)on rou)nes Error func)on New observa)on x h(x) es s o o ch Machine Learning input produces Known func)on g g ≈ f Predict value/category y Types of Problems Machine Learning problems Supervised Regression Classifica)on Unsupervised Clustering Outlier Detec)on Unsupervised Learning • Trying to find hidden structure in unlabeled data. • Clustering – Group similar data points in “clusters” • k-‐means • Mixture models • hierarchical clustering Unsupervised Learning • K-‐means • “Birds of a feather flock together” • Group samples around a “mean” – Start with random “means” and assign each point to the nearest “mean” – Calculate new “means” – Re-‐assign points to new “means”… Random ini)al means Assignment Mean Recalcula)on Re-‐assign points to new means Mean Recalcula)on Re-‐assign points to new means Unsupervised Learning • Mixture Models – Presence of subpopula)ons within an overall popula)on Anomaly DetecBon • Detect abnormal data • Whitelis)ng good behavior Anomaly DetecBon • Sta)s)cal Anomaly Detec)on 0.06 # X1 X2 … Xn 0.05 1 12 8 … 48 0.04 2 14 7 … 52 0.03 3 16 8 … 57 4 11 6 … 53 … … … … … m 13 7 … 56 0.02 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 0.01 P(x) = p(x1, α1, μ1) p(x2, α2, μ2)… p(xn, αn, μn) Anomaly DetecBon • Local Outlier Factor (LOF) – Local density – Locality is given by nearest neighbors How to learn (well) IniBal criBcal obstacles • Choosing Features: – By domain experts knowledge and exper)se – Or by Feature extrac)on algorithms • Choosing Parameters, for example: – Degree of the regression equa)on – SVM parameters (c, gamma) – Number of clusters • K-‐mean (pre) • Agglomera)ve (post) A simple example: FiMng a polynomial • The green curve is the true func)on (which is not a polynomial) • We will use a loss func)on that measures the squared error in the predic)on of y(x) from x. Some fits to the data: which is best? which is best? Occam's Razor • “The simplest explanaBon for some phenomenon is more likely to be accurate than more complicated explanaBons.” which is best? A simple way to reduce model complexity Add penal)es from Bishop Example • Mixture Models Using a validaBon set • Divide the total dataset into three subsets: – Training data: for learning the parameters of the model. – Valida)on data for deciding what type of model and what amount of regulariza)on works best. – Test data is used to get a final, unbiased es)mate of how well the algorithm works. We expect this es)mate to be worse than on the valida)on data. PCA • Data visualiza)on • Noise reduc)on • Data reduc)on • Can help with the curse of dimensionality • And more Original Variable B What are PCA PC 2 PC 1 Original Variable A • Orthogonal direc)ons of greatest variance in data Principal Components • First PC is direc)on of maximum variance from origin Wavelength 2 30 25 PC 1 20 15 10 5 0 0 5 10 15 20 Wavelength 1 25 30 10 15 20 Wavelength 1 25 30 • Subsequent PCs are orthogonal to 1st PC and describe maximum residual variance Wavelength 2 30 25 20 15 PC 2 10 5 0 0 5 Principal Components 5 2nd Principal Component, y2 1st Principal Component, y1 4 3 2 4.0 4.5 5.0 5.5 6.0 Principal Components 5 xi2 yi,2 4 yi,1 3 2 4.0 4.5 5.0 xi1 5.5 6.0 PCA on real example Original PCA r=4 64 256 3600 Thank you • Make sure you have python and IPython notebook – Easiest way to get this running hSp://con)nuum.io/downloads • Download the tutorial IPython notebook hSp://www.amaatouq.com/notebook.zip