Download ECT7110

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ECT7110
Data Preprocessing
Prof. Wai Lam
ECT7110
Data Preprocessing
1
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality
data
ECT7110
Data Preprocessing
2
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
ECT7110
Data Preprocessing
3
Forms of data preprocessing
ECT7110
Data Preprocessing
4
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
ECT7110
Data Preprocessing
5
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Normalization: scaled to fall within a small,
specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones
ECT7110
Data Preprocessing
6
Data Transformation:
Normalization
• min-max normalization
v − minA
v' =
(new _ maxA − new _ minA) + new _ minA
maxA − minA
• z-score normalization
v − mean
v'=
stand_dev
ECT7110
A
Data Preprocessing
A
7
Normalization Examples
• Suppose that the minimum and maximum values for
attribute income are 12,000 and 98,000 respectively.
How to map an income value of 73,600 to the range
of [0.0,1.0]?
• Suppose that the man and standard deviation for the
attribute income are 54,000 and 15,000. How to map
an income value of 73,600 using z-score
normalization?
ECT7110
Data Preprocessing
8
Data Reduction Strategies
• Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
• Data reduction strategies
– Dimensionality reduction
ECT7110
Data Preprocessing
9
Dimensionality Reduction
• Feature selection (i.e., attribute subset selection):
– Select a minimum set of features useful for data
mining
– reduce # of patterns in the patterns, easier to
understand
ECT7110
Data Preprocessing
10
Histograms
40
35
30
25
20
15
10
5
ECT7110
Data Preprocessing
100000
90000
80000
70000
60000
50000
40000
30000
20000
0
10000
• A popular data
reduction technique
• Divide data into
buckets and store
average (sum) for
each bucket
• Related to
quantization
problems.
11
Histograms
10
9
8
7
25
count
6
20
5
count
4
3
15
10
2
5
1
0
0
5
10
15
price ($)
20
25
Singleton buckets
ECT7110
30
1–10
11–20
price ($)
21–30
Buckets denoting a
Continuous range of values
Data Preprocessing
12
Histograms
• How are buckets determined and the attribute values
partitioned?
- Equiwidth: The width of each bucket range is uniform
- Equidepth: The buckets are created so that, roughly, the
frequency of each bucket is constant
ECT7110
Data Preprocessing
13
Histogram Examples
• Suppose that the values for the attribute age:
13, 15, 16, 16, 19, 20, 20, 21, 21, 22, 25, 25, 25. 25, 30, 30, 30,
30, 32, 33, 33, 37, 40, 40, 40, 42, 42
Equiwidth Histogram:
Equidepth Histogram:
Bucket range
Frequency
Bucket range
Frequency
13-22
10
13-21
9
23-32
9
22-30
9
33-42
8
32-42
9
ECT7110
Data Preprocessing
14
Sampling
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
• Choose a representative subset of the data
– Simple random sampling may have very poor performance
in the presence of skew
• Develop adaptive sampling methods
– Stratified sampling:
• Approximate the percentage of each class (or
subpopulation of interest) in the overall database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a
time).
ECT7110
Data Preprocessing
15
Sampling
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
(
wi
e
l
p
sam ment)
e
c
a
l
p
re
SRSW
R
ECT7110
Raw Data
Data Preprocessing
16
SRSWOR T5
(n 4) T1
T8
T6
SRSWR
(n 4) T4
T7
T1
T2
T3
T4
T5
T6
T7
T8
T4
T1
Cluster sample
(m 2)
. . .T901
T201
T101
T701
T1
T2
T3
...
T100
...
T201
T202
T203
...
T300
Stratified sample
(according to age)
ECT7110
T38
T256
T307
T391
T96
T117
T138
T263
T290
T308
T326
T387
T69
T284
young
T38
young
T391
young
T117
young
T138
middle-aged
T290
middle-aged
T326
middle-aged
T69
middle-aged
middle-aged
middle-aged
middle-aged
middle-aged
Data Preprocessing
senior
senior
young
young
middle-aged
middle-aged
middle-aged
middle-aged
senior
17
Related documents