Download X - LIACS Data Mining Group

Distributions cont.: Continuous and Multivariate Distribution, numeric attribute    Continuous data potentially has infinite domain  probability of specific values is zero  probabilities over intervals, e.g. (∞, x] Cumulative distribution function CDF  FX(x) = P(X ≤ x) Probability density function PDF    first derivative of CDF relative density of points for each value density is not probability Histograms    Estimate density in a discrete way Define cut points and count occurrences within bins How to choose cut points  equal width: cut domain (min->max) up in k equal size intervals  equal height: select k cut points such that all bins contain (approximately) n/k data points Kernel Density Estimation   Estimating the density (of the population) from the sample Observed data is smoothed over numeric domain by means of a kernel (often normal distribution) Entropy of continuous attribute    Differential entropy Generalisation of entropy to continuous case somewhat problematic Uniform distribution over [0, a]: H(X) = lg(a)  a = ½ => H(X) = lg(½) = -1 ? Multivariate Distributions Joint distributions   How frequent are combinations of values? Confusion matrix (contingency table, cross table)   counts each combination complete information X   Y T F T 0.42 0.13 0.55 F 0.12 0.33 0.45 0.54 0.46 1.0 univariate distribution of X (marginal distribution) 2 attributes: how informative is one attribute about the other? Quantifying information between attributes: joint entropy, mutual information, information gain, … Some joint distributions  X and Y are independent      = = = = 0.60.8 0.60.2 0.40.8 0.40.2 Y depends on X    0.48 0.12 0.32 0.08 higher counts along diagonal both diagonals possible X fully determines Y T F T 0.48 0.32 0.8 F 0.12 0.08 0.2 0.6 0.4 1.0 T F T 0.42 0.13 0.55 F 0.12 0.33 0.45 0.54 0.46 1.0 T F T 0.4 0 0.4 F 0 0.6 0.6 0.4 0.6 1.0 Capturing multivariate continuous distributions  2-dimensions  Problematic in higher dimensions Joint distribution over numeric x binary  Of specific relevance in Data Mining   classification How does the class (T/F) depend on a numeric attribute?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download X - LIACS Data Mining Group