Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Probability Theory in Machine Learning: A Bird View Mohammed Nasser Professor, Dept. of Statistics, RU,Bangladesh Email: [email protected] Content of Our Present Lecture Introduction Problem of Induction and Role of Probability Techniques of Machine Learning Density Estimation Data Reduction Classification and Regression Problems Probability in Classification and Regression Introduction to Kernel Methods Introduction The problem of searching for patterns in data is the basic problem of science. the extensive astronomical observations of Tycho Brahe in the 16th century allowed Johannes Kepler to discover the empirical laws of planetary motion, which in turn provided a springboard for the development of classical mechanics. Johannes Kepler 1571 - 1630 Brahe, Tycho 1546-1601 Introduction Darwin’s(1809-1882) study of nature in five years voyage on HMS Beagle revolutionized biology. the discovery of regularities in atomic spectra played a key role in the development and verification of quantum physics in the early twentieth century. Off late the field of pattern recognition has become concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories. Problem of Induction The inductive inference process: Observe a phenomenon Construct a model of the phenomenon Make predictions →This is more or less the definition of natural sciences ! →The goal of Machine Learning is to automate this process →The goal of Learning Theory is to formalize it. Problem of Induction Let us suppose somehow we have x1,x2,- - -xn measurements n is very large, e.g. n=1010000000.. Each of x1,x2,- - -xn measurements satisfies a proposition, P Can we say that( n+1) (1010000000+1)th obsevation satisfies P. certainly.. No Problem of Induction Let us consider P(n)= 1010 - n 10 The question: Is P(n) >0 ? It is positive upto very very large number, but after that becomes negative. What can we do now? Probabilistic framework to the rescue! Problem of Induction What is the probability, p that the sun will rise tomorrow? p is undefined, because there has never been an experiment that tested the existence of the sun tomorrow The p = 1, because the sun rose in all past experiments. p = 1-ε, where ε is the proportion of stars that explode per day. p = d+1/d+2, which is Laplace rule derived from Bayes rule.(d = past # days sun rose, ) Conclusion: We predict that the sun will rise tomorrow with high probability independent of the justification. The Sub-Fields of ML Classification • Supervised Learning Regression Clustering Unsupervised Learning Density estimation Data reduction Reinforcement Learning Unsupervised Learning: Density Estimation What is the wt of the elephant? What is the wt/distance of sun? What is the wt/size of baby in the womb? What is the wt of a DNA molecule? Solution of the Classical Problem Let us suppose somehow we have x1,x2,- - -xn measurements One million dollar question: How can we choose the optimum one among infinite possible alternatives to combine these n obs. to estimate the target,μ What is the optimum n? We need the concepts: ith observations Probability distributions - X i i , ~ F (x / ) Target that we want to estimate Probability measures, Meaning of Measure An ) n 1 ( A n 1 n ) Whenever An A Probability Measures Discrete P(A)=1, #(A)=finite or Continuous Absolutely Continous P{x}=0 for all x Non A.C. On any sample space ( Discrete Distributions On Rk We have special concepts Continuous Distributions of the Models Is sample mean appropriate for all the models? Different Shapes I know the means - - - I know Pr(a<X<b) for every a and b. population Approaches of Model Estmation Bayesian • Parametric Non-Bayesian Cdf estimation Nonparametric Semiparametric Density estimation Infinite-dimensional Ignorance Generally any function space is infinite-dimensional Parametric modeling assumes our ignorance is finite-dimensional Semi-parametric modeling assumes our ignorance has two parts: one finite-dimensional and the other, infinite-dimensional Non-parametric modeling assumes our ignorance is infinite-dimensional Parametric Density Estmation Nonparametric Density Estmation Semiparametric /Robust Density Estmation Parametric model Nonparametric model Application of Density Estmation Picture of Three Objects Distribution of three objects Curse of Dimension Courtesy: Bishop(2006) Curse of Dimension Courtesy: Bishop(2006) 4 2 -2 0 If population model is MVN with high corr., it works well. -4 v Unsupervised Learning: Data Reduction -3 -2 -1 0 1 2 3 Unsupervised Learning: Data Reduction 8 ??????? 0 2 4 u 6 One dimensional manifold -3 -2 -1 0 z 1 2 3 Problem-2 • Fisher’s Iris Data (1936): This data set gives the measurements in cm (or mm) of the variables – – – – – Sepal length Sepal width Petal length Petal width and Species (setosa, versicolor, and virginica) There are 150 observation with 50 from each species. We want to predict the class of a new observation . What is the available method to do job? LOOK! DEPENDENT VARIABLE IS CATEGORICAL*** INDEPENDENT VARIABLES ARE CONTINUOUS*** Problem-3 • BDHS (2004): The dependent variable is –childbearing risk with two values (High Risk and Low Risk). The target is to predict the childbearing risk based some socio economic and demographic variables. The complete list of the variables is given in the next slide. Again we are in the situation where the dependent variable is categorical, independent variables are mixed. Problem-4 Face Authentication (/ Identification) • Face Authentication/Verification (1:1 matching) • Face Identification/Recognition (1:N matching) Applications Access Control www.viisage.com www.visionics.com Applications Video Surveillance (On-line or off-line) Face Scan at Airports www.facesnap.de Why is Face Recognition Hard? Inter-class Similarity Twins Father and son Intra-class Variability Handwritten digit recognition We want to recognize the postal codes automically Problem 6: Credit Risk Analysis • Typical customer: bank. • Database: – Current clients data, including: – basic profile (income, house ownership, delinquent account, etc.) – Basic classification. • Goal: predict/decide whether to grant credit. Problem 7: Spam Email Detection, Search Engine etc traction.tractionsoftware.com www.robmillard.com Problem 9: 40 Genome-wide data mRNA expression data hydrophobicity data protein-protein interaction data sequence data (gene, protein) Problem 10: Robot control • Goal: Control a robot in an unknown environment. • Needs both – to explore (new places and action) – to use acquired knowledge to gain benefits. • Learning task “control” what is observed! Problem-11 • Wisconsin Breast Cancer Database (1992): This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. Or you get it from (http://www.potschi.de/svmtut/breast-cancer-wisconsin.data). The variables are: – Clump Thickness 1 - 10 – Uniformity of Cell Size 1 - 10 – Uniformity of Cell Shape 1 - 10 – Marginal Adhesion 1 - 10 – Single Epithelial Cell Size 1 – 10 – Bare Nuclei 1 - 10 – Bland Chromatin 1 - 10 – Normal Nucleoli 1 - 10 – Mitoses 1 - 10 – Status: (benign, and malignant) There are 699 observations are available. Now we want to predict the status of a patient weather it benign or malignant. DEPENDENT VARIABLE IS CATEGORICAL. Independent variables??? Problem 12:Data Description Cardiovascular diseases affect the heart and blood vessels and include shock, heart failure, heart valve disease, congenital heart disease etc. Despres et al. pointed out that the topography of adipose tissue (AT) is considered as risk factors for cardiovascular disease. It is important to measure the amount of intra-abdominal AT as part of the evaluation of the cardiovasculardisease risk of an individual. Adipose Tissue Data Description • Problem: Computed tomography of AT is ---- very costly • ----- requires irradiation of the subject ----- not available to many physicians. Not available to Physician • Materials: The simple anthropometric measurements such as waist circumference which can be obtained cheaply and easily. • Variables: Y= Number of can deep abdominal AT estimate How well we predict and Waist Circumference X=The Waist Circumference (in cm.) the deep abdominal AT Total observation is 109 (men) from the knowledge of waist circumference ? Data sources: W. W. Daniel (2003) Complex Problem 13 Hypothesis: The infant’s size at birth is associated with the maternal characteristics and SES Variables: X Maternal & SES 1. Age (x1) 2. Parity (x2) 3. Gestational age (x3) 4. Mid-Upper Arm Circumference MUAC (x4) 5. Supplementation group (x5) 6. SES index (x6) Infant’s Size at birth: Y 1. Weight (y1) 2. Length (y2) 3. MUAC (y3) 4. Head circumference (HC) (y4) 5. Chest Circumference (CC) (y5) CCA, KCCA, MR, PLS etc give us some solutions to this complex problem. Data Vectors Collections of features e.g. height, weight, blood pressure, age, . . . Can map categorical variables into vectors Matrices Images, Movies Remote sensing and satellite data (multispectral) Strings Documents Gene sequences Structured Objects XML documents Graphs Let US Summarize!! Classification (reminder) Y=g(X) X!Y Anything: • continuous (, d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) •… • discrete: – {0,1} binary – {1,…k} multi- class – tree, etc. structured Classification (reminder) X Perceptron Logistic Regression Support Vector Machine Anything: • continuous (, d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) •… Kernel trick Decision Tree Random Forest Regression Y=g(X) X!Y Anything: • continuous (, d, …) • continuous: – , d • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) •… Not Always Regression X Perceptron Normal Regression Support Vector regression Anything: • continuous (, d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) •… Kernel trick GLM Which is better? Which is better? Some Necessary Terms Training data: the X;Y we are given. Testing data: the X;Y we will see in future. Training error: the average value of loss on the training data. Test error: the average value of loss on the test data. How do we calculate them? Our Real Goal What is our real goal? To do well on the data we have seen already? Usually not. We already have the answers for that data. We want to perform well on future unseen data. So ideally we would like to minimize the test error. How to do this if we don't have test data? Probabilistic framework to rescue us! Idea: look for regularities in the observed phenomenon These can be generalized from the observed past to the future No Free Lunch If there is no assumption on how the past is related to the future, prediction is impossible If there is no restriction on the possible phenomena, generalization is impossible We need to make assumptions • Simplicity is not absolute • Data will never replace knowledge • Generalization = data + knowledge Two types of assumptions: Future observations related to past ones: Stationarity of the phenomenon Constraints on the phenomenon :Notion of simplicity Probabilistic Model Relationship between past and future observations: Sampled independently from the same distribution Independence: each new observation yields maximum information Identical distribution: the observations give information about the underlying phenomenon (here a probability distribution) We consider an input space X and output space Y. Assumption: The pairs (X, Y ) ε X × Y are distributed according to P (unknown). Data: We observe a sequence of n i.i.d. pairs (Xi, Yi) sampled according to P. Goal: construct a function g : X Y which predicts Y from X. Problem Revisited • The distribution of baby weights at a hospital ~ N(3400, 360000) Your “Best guess” at a random baby’s weight, given no information about the baby, is what? 3400 grams??? But, what if you have relevant information? Can you make a better guess? At 30 weeks… Y=birthweight 3000 (x,y)= (g) (30,3000) X=gestation time (weeks) 30 At 30 weeks… • The babies that gestate for 30 weeks appear to center around a weight of 3000 grams. • In Math-Speak… • E(Y/X=30 weeks)=3000 grams Note the conditional expectation ??? Show that V(Y)> V[E(Y/X)] if Y and X are not independent. R(g)=∫∫(y-g(x))2 p(x,y)dxdy is minimum when g(x)=E(Y/X=x) Risk Functional Risk functional, RL,P(g)= L( x, y, g ( x))dP( x, y) X Y L( x, y, g ( x))dP( y / x)dPX XY Population Regression functional /classifier, g* RL,P ( g * ) inf RL,P ( g ) g: X Y P is chosen by nature , L is chosen by the scientist Both RL,P(g*) and g* are uknown From sample D, we will select gD by a learning method(???) Empirical risk minimization Empirical Risk functional, Problems of empirical risk minimization R L ,P ( g ) n = L( x, y, g ( x))dPn ( x, y) X Y L( x, y, g ( x))dPn ( y / x) dPn X XY 1 n L( xi , yi , g ( xi )) n i 1 What Can We Do? We can restrict the set of functions over which we minimize empirical risk functionals modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two. Regularization Structural risk Minimization Probabilistic vs ERM Modeling We Do Need More We want RL,P(gD) should be very close to RL,P(g*) Does closeness of RL,P(gD) to RL,P(g*) imply closeness of gD to g*? Both RL,P(gD) and gD should be smooth in P. To measure closeness of RL,P(gD) to RL,P(g*) we need limit Theorems and inequalities of Probability Theory. To ensure convergence of gD to g* we need very strong form of convergence of RL,P(gD) to RL,P(g*) . We Do Need More To check smoothness of gD w.r. t. P we need tools of Functional Calculus How do we find gD? Does g* exists? It is a problem of Optimization in Function Space Universally Consistent: RL ,P (g ) RL ,P (g ), n D P * SVM’s are often universally consistent. Rate of convergence?? Consistency of ERM Remp Remp n* R n* R Key Theorem of VC-theory • For bounded loss functions, the ERM principle is consistent if and only if the empirical risk Remp ( ) converges uniformly to the true risk R ( ) in the following sense lim n P[sup R( ) Remp ( ) ] 0, 0 • consistency is determined by the worst-case approximating function, providing the largest discrepancy btwn the empirical risk and true risk Note: this condition is not useful in practice. Need conditions for consistency in terms of general properties of a set of loss functions (approximating functions) 68 Empirical Risk Minimization Vs Structural Risk minmization Empirical risk minimization is not good from the viewpoint of generalization error Vapnik and Chervonenkis( 1995, 1998) studied under what conditions uniform convergence of Empirical risk to expected takes place. The results are formulated in terms of three importan tquantities measures—The VC entropy, the annealed VC entropy and the growth function– related two topological concepts, ε-net and covering number. These concepts lead to working idea of VC dimension and SRM Empirical Risk Minimization Vs Structural Risk minmization • Principle: Minimize upper bound of the true risk True Risk <= Empirical Risk + Complexity Penalty 1 2n R (g) Remp (g) (h ln( 1) ln ) n h 4 With probability 1-δ Structural Risk Minimization Kernel methods: Heuristic View Traditional or non traditional 73 Steps for Kernel Methods [k(xi ,xj)] A positive semi definite matrix what K???? f(x)=∑αik(xi, x) K= Pattern function DATA MATRIX Kernel Matrix, Why p.s.d?? 74 Kernel methods: Heuristic View f f f Original Space Feature Space Kernel methods: basic ideas • The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: • The expectation is that the feature space has a much higher dimension than the input space. • Feature space has a inner-product like k(xi, xj)=(Φ(xi), Φ(xj)) 75 76 Kernel methods: Heuristic View Form of functions • So kernel methods use linear functions in a feature space: • For regression this could be the function • For classification require thresholding Kernel methods: Heuristic View Feature spaces : x ( x), R F d non-linear mapping to F 1. high-D space L2 2. infinite-D countable space : 3. function space (Hilbert space) example: ( x, y ) ( x , y , 2 xy) 2 2 77 Kernel methods” Heuristic View 78 Kernel trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : x ( x), kernel Gij xi , x j Gij ( xi ), ( x j ) K ( xi , x j ) If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features Φ(x) Gist of Kernel methods Choice of a Kernel Function Through choice of a kernel function we choose a Hilbert space We then apply the linear method in this new space without increasing computational complexity using mathematical niceties of this space Acknowledgement Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Statistical Learning Theory Olivier Bousquet Department of Empirical Inference Max Planck Institute of Biological Cybernetics [email protected] Machine Learning Summer School, August 2003