Download Data Preprocessing Why Data Preprocessing? Major Tasks in Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Preprocessing
Chris Williams, School of Informatics
University of Edinburgh
Why Data Preprocessing?
Data in the real world is dirty. It is:
Data preparation is a big issue for data mining. Cabena et al (1998) extimate that data
preparation accounts for 60% of the effort in a data mining application.
• incomplete, e.g. lacking attribute values
• Data cleaning
• Data integration and transformation
• noisy, e.g. containing errors or outliers
• Data reduction
• inconsistent, e.g. containing discrepancies in codes or names
Reading: Han and Kamber, chapter 3
GIGO: need quality data to get quality results
Major Tasks in Data Preprocessing
Data Cleaning Tasks
• Handle missing values
Data cleaning
Data integration
• Identify outliers, smooth out noisy data
2, 32, 100, 59, 48
Data reduction
attributes
A1 A2 A3
...
T1
T2
T3
T4
...
T2000
A126
transactions
transactions
Data transformation
0.02, 0.32, 1.00, 0.59, 0.48
A1
•
•
•
•
Data cleaning
Data integration
Data transformation
Data reduction
attributes
A3
... A115
T1
T4
...
T1456
Figure from Han and Kamber
• Correct inconsistent data
• Missing Data
What happens if input data is missing? Is it missing at random (MAR) or is there a
systematic reason for its absence? Let xm denote those values missing, and xp those
values that are present.
Data Integration
Combines data from multiple sources into a coherent store
If MAR, some “solutions” are
– Model P (xm|xp) and average (correct, but hard)
– Replace data with its mean value (?)
– Look for similar (close) input patterns and use them to infer missing values (crude
version of density model)
– Reference: Statistical Analysis with Missing Data R. J. A. Little, D. B. Rubin, Wiley
(1987)
• Entity identification problem: identify real-world entities from multiple data
sources, e.g. A.cust-id ≡ B.cust-num
• Detecting and resolving data value conflicts: for the same real-world
entity, attribute values are different, e.g. measurement in different units
• Outliers detected by clustering, or combined computer and human inspection
Data Transformation
Data Reduction
• Feature selection: Select a minimum set of features x̃ from x so that:
• Normalization, e.g. to zero mean, unit standard deviation
old data − mean
new data =
std deviation
or max-min normalization to [0, 1]
new data =
old data − min
max − min
– P (class|x̃) closely approximates P (class|x)
– The classification accuracy does not significantly decrease
• Data Compression (lossy)
• PCA, Canonical variates
• Sampling: choose a representative subset of the data
– Simple random sampling vs stratified sampling
• Normalization useful for e.g. k nearest neighbours, or for neural networks
• Hierarchical reduction: e.g. country-county-town
• New features constructed, e.g. with PCA or with hand-crafted features
Feature Selection
Descriptive Modelling
Chris Williams, School of Informatics
University of Edinburgh
Usually as part of supervised learning
• Stepwise strategies
• (a) Forward selection: Start with no features. Add the one which is the best predictor.
Then add a second one to maximize performance using first feature and new one; and
so on until a stopping criterion is satisfied
• (b) Backwards elimination: Start with all features, delete the one which reduces
performance least, recursively until a stopping criterion is satisfied
Descriptive models are a summary of the data
• Describing data by probability distributions
– Parametric models
– Mixture Models
• Forward selection is unable to anticipate interactions
• Backward selection can suffer from problems of overfitting
• They are heuristics to avoid considering all subsets of size k of d features
• Clustering
– Non-parametric models
– Graphical models
Describing data by probability distributions
– Partition-based Clustering Algorithms
• Parametric models, e.g. single multivariate Gaussian
– Hierarchical Clustering
– Probabilistic Clustering using Mixture Models
Reading: HMS, chapter 9
• Mixture models, e.g. mixture of Gaussians, mixture of Bernoullis
• Non-parametric models, e.g. kernel density estimation
fˆ(x) =
n
1 X
Kh(x − xi)
n i=1
Does not provide a good summary of the data, expensive to compute on
large datasets
Probability Distributions: Graphical Models
Clustering
Clustering is the partitioning of a data set into groups so that points in one group are similar
to each other and are as different as possible from points in other groups
• Mixture of Independence Models
• Partition-based Clustering Algorithms
C
• Hierarchical Clustering
X1
X2
X3
X
4
X
5
X
6
• Probabilistic Clustering using Mixture Models
Examples
(also Naive Bayes model)
• Split credit card owners into groups depending on what kinds of purchases they make
• Fitting a given graphical model to data
• Search over graphical structures
Defining a partition
• In biology, can be used to derive plant and animal taxonomies
• Group documents on the web for information discovery
k-means algorithm
• Clustering algorithm with k groups
• Mapping c from input example number to group to which it belongs
• In Rd , assign to group j a cluster centre mj . Choose both c and the mj ’s so as to
minimize
n
X
|xi − mc(i) |2
i=1
• Given c, optimization of the mj ’s is easy; mj is just the mean of the data vectors
assigned to class j
• Optimiztion over c: cannot compute all possible groupings, use the k-means algorithm
to find a local optimum
initialize centres m1 , . . . , mk
while (not terminated)
for i = 1, . . . , n
calculate |xi − mj |2 for all centres
assign datapoint i to the closest centre
end for
recompute each mj as the mean of the
datapoints assigned to it
end while
• This is a batch algorithm.
• There is also an on-line
version, where the centres are updated after
each datapoint is seen
• Also k-medoids; find
a representative object
for each cluster centre
• Choice of k?
Hierarchical clustering
80
11
75
for i = 1, . . . , n let Ci = {xi}
while there is more than one cluster left do
let Ci and Cj be the clusters minimizing
the distance D(Ci, Cj ) between any two clusters
Ci = Ci ∪ C j
remove cluster Cj
end
7
6
3
1
70
16
65
17
15
10
5
13
60
2
12
14
55
8
9
50
45
15
4
20
25
30
35
40
45
_____________|------> p08
|
|______|------> p04
|
|------> p09
|--------------------------|
_______|----> p02
|
|
|--------|
|----> p12
|
|-------------|
|-------> p14
|
|________|------> p10
-|
|------> p15
|
________|-----> p03
|
|--------------|
|-----> p06
|
|
|
________|-----> p01
|--------------------------|
|--------|
|-----> p07
|
|--------> p11
|______________|-------> p05
|_______|------> p13
|______|-----> p16
|-----> p17
• Results can be displayed as a dendrogram
• This is agglomerative clustering; divisive techniques are also possible
Distance functions for hierarchical clustering
• Single link (nearest neighbour)
Probabilistic Clustering
• Using finite mixture models, trained with EM
Dsl (Ci , Cj ) = min{d(x, y)|x ∈ Ci, y ∈ Cj }
x,y
The distance between the two closest points, one from each cluster. Can lead to
“chaining”.
• Complete link (furthest neighbour)
Dcl (Ci, Cj ) = max{d(x, y)|x ∈ Ci , y ∈ Cj }
x,y
• Can be extended to deal with outlier by using an extra, broad distribution to “mop up”
outliers
• Can be used to cluster non-vectorial data, e.g. mixtures of Markov models for
sequences
• Methods for comparing choice of k
• Centroid measure: distance between clusters is difference between centroids
• Disadvantage: parametric assumption for each component
• Others possible
• Disadvantage: complexity of EM relative to e.g. k-means
Graphical Models: Causality
• J. Pearl, Causality, Cambridge UP (2000)
• To really understand causal structure, we need to predict effect of
interventions
• Semantics of do(X = 1) in a causal belief network, as opposed to
conditioning on X = 1
Causal Bayesian Networks
A causal Bayesian network is a
Bayesian network in which each arc
is interpreted as a direct causal influence between a parent node and
a child node, relative to the other
nodes in the network.
(Gregory Cooper, 1999, section 4)
Causation = behaviour under interventions
Season
X1
Rain
Sprinkler
X2
X3
X4
Wet
Slippery
X5
• Example: smoking and lung cancer
An Algebra of Doing
• Available: algebra of seeing (observation)
e.g. what is the chance it rained if we see that the grass is wet?
Truncated factorization formula
0
P (x1 , . . . , xn|x̂i) =
P (rain|wet) = P (wet|rain)P (rain)/P (wet)
• Needed: algebra of doing
e.g. what is the chance it rained if we make the grass wet?
P (rain|do(wet)) = P (rain)
( Q
0
j6=i P (xj |paj ) if xi = xi
0
0
if xi 6= xi

0

n)
 P (x1 ,...,x
if xi = xi
0
P
(x
|pa
)
i
P (x1 , . . . , xn|x̂i) =
i
0

 0
if xi 6= xi
0
compare with conditioning
Intervention as surgery on graphs
Season

0

n)
 P (x1 ,...,x
if xi = xi
0
0
P (xi)
P (x1 , . . . , xn|xi) =
0

 0
if xi 6= xi
X1
Sprinkler = On
Rain
X2
X3
X4
Wet
Slippery
X5
Controlling confounding bias
We wish to evaluate the effect of X on Y ; what other factors Z (known as
covariates or confounders) do we need to adjust for?
Simpson’s “paradox”: an event C increases the probability of E in a
population p, but decreases the probability of E in every subpopulation.
E.g. UC-Berkeley investigated for sex-bias (1975). Overall, higher rate of
admission of males, but every for department there was a slight bias in favour
of admitting females.
[Explanation: females applied to more competitive departments where
admission rate was low]
• Another example: administering a drug gives rise to lower rates of
recovery than giving a placebo for both males and females, but overall it
can appear better
• What treatment would you give to a patient coming into your office?
Apparent answer is “if know that patient is male or female, don’t give
drug, but if gender is unknown, do!”. This answer is ridiculous!
• Correct answer to question will depend not only on observed
probabilities, but also on assumed causal model. Diagrams below can
have the same P (C, E, F ), but use of combined or gender-specific
tables depends on diagram
Treatment
C
Gender Treatment
F
C
E
E
Recovery
Recovery
use gender-specific table
Blood
F Pressure
use combined table