Download X - LIACS Data Mining Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Distributions cont.:
Continuous and Multivariate
Distribution, numeric attribute



Continuous data potentially has infinite domain

probability of specific values is zero

probabilities over intervals, e.g. (∞, x]
Cumulative distribution function CDF
 FX(x) = P(X ≤ x)
Probability density function PDF



first derivative of CDF
relative density of points for each value
density is not probability
Histograms



Estimate density in a discrete way
Define cut points and count occurrences within bins
How to choose cut points
 equal width: cut domain (min->max) up in k equal size
intervals

equal height: select k cut points such that all bins contain
(approximately) n/k data points
Kernel Density Estimation


Estimating the density (of the population) from the
sample
Observed data is smoothed over numeric domain
by means of a kernel (often normal distribution)
Entropy of continuous attribute



Differential entropy
Generalisation of entropy to continuous case
somewhat problematic
Uniform distribution over [0, a]: H(X) = lg(a)
 a = ½ => H(X) = lg(½) = -1 ?
Multivariate Distributions
Joint distributions


How frequent are combinations of values?
Confusion matrix (contingency table, cross table)


counts each combination
complete information
X


Y
T
F
T
0.42
0.13
0.55
F
0.12
0.33
0.45
0.54
0.46
1.0
univariate
distribution of X
(marginal distribution)
2 attributes: how informative is one attribute about
the other?
Quantifying information between attributes: joint
entropy, mutual information, information gain, …
Some joint distributions

X and Y are independent





=
=
=
=
0.60.8
0.60.2
0.40.8
0.40.2
Y depends on X



0.48
0.12
0.32
0.08
higher counts along diagonal
both diagonals possible
X fully determines Y
T
F
T
0.48
0.32
0.8
F
0.12
0.08
0.2
0.6
0.4
1.0
T
F
T
0.42
0.13
0.55
F
0.12
0.33
0.45
0.54
0.46
1.0
T
F
T
0.4
0
0.4
F
0
0.6
0.6
0.4
0.6
1.0
Capturing multivariate continuous
distributions

2-dimensions

Problematic in higher dimensions
Joint distribution over numeric x binary

Of specific relevance in Data Mining


classification
How does the class (T/F) depend on a numeric
attribute?
Related documents