Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University 1 2 DATA ANALYSIS AND UNCERTAINTY Data Mining Lecture V [Chapter 4, Hand, Manilla, Smyth ] 3 4.1 Introduction 4.2 Probability and its interpretation 4.3 (multiple) random variables 4.4 samples, models, and patterns 4 4.2 Probability and its interpretation Probability, Possibility, Chance, Randomness, luck, hazard, fate, … 5 4.2 Probability and its interpretation Probability Theory: interpretation Probability Calculus: mathematical manipulation and computing 6 4.2 Probability and its interpretation Frequentist view: probability is objective: Prob = # times/# total EXAMPLE: traditional Prob.Theor. Subjective probability: subjective probabilities EXAMPLE: Bayesian statistics. 7 4.3 Random Variables and their Relationships Random Variables [4.3] random variable X: multivariate random variables marginal density 8 4.3 Random Variables and their Relationships conditional density & dependency: * example: supermarket purchases 9 4.3 Random Variables 10 4.3 Random Variables 11 RANDOM VARIABLES Example: supermarket purchases X = n customers x p products; X(i,j) = Boolean variable: “Has customer #i bought a product of type j ?” nA = sum(X(:,A)) is number of customers that bought product A nB = sum(X(:,B)) is number of customers that bought product B nAB = sum(X(:,A).*X(:,B)) is number of customers that bought both products A and B 12 RANDOM VARIABLES Example: supermarket purchases customer# 1 2 3 4 A 1 0 1 1 B 0 1 1 1 pA = 3/4 pB = 3/4 pAB = 2/4 pA.pB ≠ pAB 13 RANDOM VARIABLES (conditionally) independent: p(x,y) = p(x)*p(y) i.e. : p(x|y) = p(x) 14 Markov Models 15 RANDOM VARIABLES Simpson’s paradox Observation of two different treatments for several categories of patients Category Treatment A Treatment B Old 0.33 0.20 Young 1.00 0.53 Total 0.50 0.40 16 RANDOM VARIABLES Explanation: the sets are POOLED: Category Treatment A Old 2 / 10 0.20 Young 48 / 90 0.53 Total 50 / 100 0.50 Treatment B 30 / 90 0.33 10 / 10 1.00 40 / 100 0.40 Both for Old and Young individually, treatment B seems best, but for total Treatment A seems superior. 17 RANDOM VARIABLES - 1st order Markov processes [example: text analysis and reconstruction with Markov chains] Demo: dir *.txt type 'werther.txt' [m0,m1,m2] = Markov2Analysis('werther.txt'); CreateMarkovText(m0,m1,m2,3,80,20); Nederlands, duits, frans, engels, matlabs, europees Correlation, dependency, causation Example 4.3 18 SAMPLING Sampling [4.4] - Large quantities of observations: central limit theorem : normally distributed Small sample size: modeling: OK, pattern recognition: not Figure 4.1 Role of model parameters Probability of observing data D = { x(1), x(1), …, x(n)} with independent probabilities: n P( D | M , ) p(x(i ) | M , ) i 1 with fixed parameter-set Θ. 19 SAMPLING 20 ESTIMATION Estimation [4.5] - Properties 1. ˆ is estimator of θ – depends on sample so stochastic variable with E[ˆ] etc. 2. Bias: bias (ˆ) E[ˆ] unbiased: bias (ˆ) 0 , i.e. no systematic bias 3. consistent estimator : asymptotically unbiased 4. Variance: var(ˆ) E[ˆ E[ˆ]] 2 5. Best unbiased estimator: unbiased estimator with smallest variance 21 Maximum Likelihood Estimation - Maximum Likelihood Estimation n - Expression 4.7: L( | D) f (x(i ) | ) i 1 scalar function in θ drop all terms not containing θ value of θ with highest L(θ) is maximum likelihood estimator MLE: ˆMLE arg max L( ) - Example 4.4 + figure 4.2 Demo: MLbinomial 22 Maximum Likelihood Estimation - Example 4.5: Normal distribution, log L(θ) Example 4.6: Sufficient statistic: nice idea but practical infeasible MLE: biased bur consistent O(1/n) In practice log- likelihood l(θ) = log L(θ) is more attractive Multiple parameters: complex but relevant: EM-algorithm [treated later] Determination of confidence levels: example 4.8 23 Maximum Likelihood Estimation - Sampling using central limit theorem (CLT) with large samples: Bootstrap method: Example 4.9: use observed data as i. The real distribution F, ii. To generate many sub-samples, iii. Estimate properties in these sub-samples using the “known” distribution F, iv. Compare sub-sample estimate and “real” estimate and apply CLT. 24 BAYESIAN ESTIMATION - Bayesian estimation - Here not frequentist approach but subjective approach: D is given and parameters θ are random variables. p(θ) represents our degree of belief in the value of θ in practice this means that we have to make all sorts of (wild) assumptions about p. prior distributon: p(θ) posterior distributon: p(θ|D) Modification of prior to posterior by Bayes theorem: p( D | ) p( ) p ( D | ) p( ) p( | D) p( D) p( D | ) p( )d - 25 BAYESIAN ESTIMATION - MAP: Maximum a Posteriori method: < θ> : mean or mode of the posterior distribution MAP relates to MLE Example 4.10 Bayesian approach: rather than point-estimates keep full knowledge of uncertainties in model(s) This causes massive computation -> therefore only recent decade popular 26 BAYESIAN ESTIMATION - Sequential updating: p( | D) p( D | ) p( ) (equation 4.10) p( | D1 , D2 ) p( D2 | ) p( D1 | ) p( ) (equation 4.15) Reminds of Markov Process Algorithm: * start with p(θ) * new data D: use eq. 4.10 to obtain posterior distribution * new data D2 : use eq. 4.15 to obtain new posterior distribution * repeat ad nauseam 27 BAYESIAN ESTIMATION - Large Problem: different observers may define different p(θ); the choice is subjective! hence: different results! Analysis depends on choice of p(θ) - Often a consensus prior e.g. Jeffrey’s prior (equations 4.16 and 4.17) - Computation of credibility interval - Markov Chain Monte Carlo (MCMC) methods for estimating posterior distributions - Similarly: Bayesian Belief Networks (BBN) 28 BAYESIAN ESTIMATION - Bayesian Approach: Uncertainty_Model + Uncertainty_Params MLE: aim is: a point estimate Bayesian: aim is: posterior distribution Bayesian estimate = weighted estimate over all models M and all parameters θ where the weights are the likelihoods of the parameters in the different models PROBLEM: these weights are difficult to estimate 29 30 PROBABILISTIC MODEL-BASED CLUSTERING USING MIXTURE MODELS Data Mining Lecture VI [4.5, 8.4, 9.2, 9.6, Hand, Manilla, Smyth ] 31 Probabilistic Model-Based Clustering using Mixture Models A probability mixture model A mixture model is a formalism for modeling a probability density function as a sum of parameterized functions. In mathematical terms: 32 A probability mixture model where pX(x) is the modeled probability distribution function, K is the number of components in the mixture model, and ak is mixture proportion of component k. By definition, 0 < ak < 1 for all k = 1…K and: 33 A probability mixture model h(x | λk) is a probability distribution parameterized by λk. Mixture models are often used when we know h(x) and we can sample from pX(x), but we would like to determine the ak and λk values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations. 34 A common approach for ‘decomposing’ a mixture model It is common to think of mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions. 35 Probabilistic Model-Based Clustering using Mixture Models The EM-algorithm [book section 8.4] 36 Mixture Decomposition: The ‘Expectation-Maximization’ Algorithm The Expectation-maximization algorithm computes the missing memberships of data points in our chosen distribution model. It is an iterative procedure, where we start with initial parameters for our model distribution (the ak's and λk's of the model listed above). The estimation process proceeds iteratively in two steps, the Expectation Step, and the Maximization Step. 37 The ‘Expectation-Maximization’ Algorithm The expectation step With initial guesses for the parameters in our mixture model, we compute "partial membership" of each data point in each constituent distribution. This is done by calculating expectation values for the membership variables of each data point. 38 The ‘Expectation-Maximization’ Algorithm The maximization step With the expectation values in hand for group membership, we can recompute plug-in estimates of our distribution parameters. For the mixing coefficient of this is simply the fractional membership of all data points in the second distribution. 39 EM-algorithm for Clustering The Suppose we have data D with a model with parameters and hidden parameters H Interpretation: H = the class label Log-likelihood of observed data: l ( ) log p( ) log p( D, H | ) H 40 EM-algorithm for Clustering With p the probability over the data D. Let Q be the unknown distribution over the hidden parameters H Then the log-likelihood is: 41 Then the log-likelihood is: l ( ) log p( D, H | ) H p ( D, H | ) l ( ) log Q( H ) Q( H ) H p ( D, H | ) l ( ) Q( H ). log Q( H ) H [*Jensen’s inequality] 1 l ( ) Q( H ). log p( D, H | ) Q( H ). log F (Q, ) Q( H ) H H 42 Jensen’s inequality for a concave-down function, the expected value of the function is less than the function of the expected value. The gray rectangle along the horizontal axis represents the probability distribution of x, which is uniform for simplicity,43but the general idea applies for any distribution EM-algorithm So: F(Q,) is a lower-bound on the log-likelihood function l(Q,) . EM alternates between: E-step: maximising F to Q with fixed , and: M-step: maximising F to with fixed Q . 44 EM-algorithm E-step: M-step: Q k 1 k 1 arg max F (Q , ) k k Q arg max F (Q k 1 , ) k 45 Probabilistic Model-Based Clustering using Gaussian Mixtures 46 Probabilistic Model-Based Clustering using Mixture Models 47 Probabilistic Model-Based Clustering using Mixture Models 48 Probabilistic Model-Based Clustering using Mixture Models 49 Probabilistic Model-Based Clustering using Mixture Models Gaussian Mixture Decomposition Gaussian mixture Decomposition is a good classificator. It allows supervised as well as unsupervised learning (find how many classes is optimal, how they should be defined,...). But training is iterative and time consuming. Idea is to set position and width of gaussian distribution(s) to optimize the coverage of learning samples. 50 Probabilistic Model-Based Clustering using Mixture Models 51 Probabilistic Model-Based Clustering using Mixture Models 52 Probabilistic Model-Based Clustering using Mixture Models 53 Probabilistic Model-Based Clustering using Mixture Models 54 Probabilistic Model-Based Clustering using Mixture Models 55 Probabilistic Model-Based Clustering using Mixture Models 56 Probabilistic Model-Based Clustering using Mixture Models 57 Probabilistic Model-Based Clustering using Mixture Models 58 Probabilistic Model-Based Clustering using Mixture Models 59 Probabilistic Model-Based Clustering using Mixture Models 60 61 The End 62