Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Distributions cont.: Continuous and Multivariate Distribution, numeric attribute Continuous data potentially has infinite domain probability of specific values is zero probabilities over intervals, e.g. (∞, x] Cumulative distribution function CDF FX(x) = P(X ≤ x) Probability density function PDF first derivative of CDF relative density of points for each value density is not probability Histograms Estimate density in a discrete way Define cut points and count occurrences within bins How to choose cut points equal width: cut domain (min->max) up in k equal size intervals equal height: select k cut points such that all bins contain (approximately) n/k data points Kernel Density Estimation Estimating the density (of the population) from the sample Observed data is smoothed over numeric domain by means of a kernel (often normal distribution) Entropy of continuous attribute Differential entropy Generalisation of entropy to continuous case somewhat problematic Uniform distribution over [0, a]: H(X) = lg(a) a = ½ => H(X) = lg(½) = -1 ? Multivariate Distributions Joint distributions How frequent are combinations of values? Confusion matrix (contingency table, cross table) counts each combination complete information X Y T F T 0.42 0.13 0.55 F 0.12 0.33 0.45 0.54 0.46 1.0 univariate distribution of X (marginal distribution) 2 attributes: how informative is one attribute about the other? Quantifying information between attributes: joint entropy, mutual information, information gain, … Some joint distributions X and Y are independent = = = = 0.60.8 0.60.2 0.40.8 0.40.2 Y depends on X 0.48 0.12 0.32 0.08 higher counts along diagonal both diagonals possible X fully determines Y T F T 0.48 0.32 0.8 F 0.12 0.08 0.2 0.6 0.4 1.0 T F T 0.42 0.13 0.55 F 0.12 0.33 0.45 0.54 0.46 1.0 T F T 0.4 0 0.4 F 0 0.6 0.6 0.4 0.6 1.0 Capturing multivariate continuous distributions 2-dimensions Problematic in higher dimensions Joint distribution over numeric x binary Of specific relevance in Data Mining classification How does the class (T/F) depend on a numeric attribute?