Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
2. Data Preparation and Preprocessing
Data and Its Forms
Preparation
Preprocessing and Data Reduction
1
Data Types and Forms



Attribute-vector data:
Data types

numeric, categorical (see the

static, dynamic (temporal)
A1
A2
…
An
C
hierarchy for its relationship)
Other data forms




2/4/03
distributed data
text, Web, meta data
images, audio/video
You have seen most of
them after the invited talks.
CSE 575 Data Mining by H. Liu
2
Data Preparation







An important & time consuming task in KDD
High dimensional data (20, 100, 1000)
Huge size data
Missing data
Outliers
Erroneous data (inconsistent, misrecorded,
distorted)
Raw data
2/4/03
CSE 575 Data Mining by H. Liu
3
Data Preparation Methods


Data annotation as in driving data analysis
Data normalization


Dealing with sequential or temporal data


Another example is of image mining
Transform it to tabular form
Removing outliers

2/4/03
Different types
CSE 575 Data Mining by H. Liu
4
Normalization

Decimal scaling



Min-max normalization into the new max/min range:



v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1.
For the range between -991 and 99, k is 1000, -991  .991
v’ = (v - minA)/(maxA - minA) *
(new_maxA - new_minA) + new_minA
v = 73600 in [12000,98000]  v’= 0.716 in [0,1] (new range)
Zero-mean normalization:



2/4/03
v’ = (v - meanA) / std_devA
(1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1)
If meanIncome = 54000 and std_devIncome = 16000,
then v = 73600  1.225
CSE 575 Data Mining by H. Liu
5
Temporal Data

The goal is to forecast t(n+1) from previous
values


X = {t(1), t(2), …, t(n)}
An example with two features and widow size 3

How to determine the window size?
Inst
A(n-2)
A(n-1)
A(n)
B(n-2)
B(n-1)
B(n)
211
1
7
10
6
215
211
214
6
214
2
10
6
11
211
214
221
4
11
221
5
12
210
3
6
11
12
214
221
210
6
14
218
4
11
12
14
221
210
218
Time
A
B
1
7
215
2
10
3
2/4/03
CSE 575 Data Mining by H. Liu
6
Outlier Removal


Data points inconsistent with the majority of data
Different outliers



Valid: CEO’s salary,
Noisy: One’s age = 200, widely deviated points
Removal methods



Clustering
Curve-fitting
Hypothesis-testing with a given model
2/4/03
CSE 575 Data Mining by H. Liu
7
Data Preprocessing

Data cleaning




missing data
noisy data
inconsistent data
Data reduction



2/4/03
Dimensionality reduction
Instance selection
Value discretization
CSE 575 Data Mining by H. Liu
8
Missing Data

Many types of missing data




not measured
truly missed
wrongly placed, and ?
Some methods




2/4/03
leave as is
ignore/remove the instance with missing value
manual fix (assign a value for implicit meaning)
statistical methods (majority, most likely,mean,
nearest neighbor, …)
CSE 575 Data Mining by H. Liu
9
Noisy Data

Random error or variance in a measured
variable



Noise is normally a minority in the data set


inconsistent values for features or classes (process)
measuring errors (source)
Why?
Removing noise



2/4/03
Clustering/merging
Smoothing (rounding, averaging within a window)
Outlier detection (deviation-based or distance-based)
CSE 575 Data Mining by H. Liu
10
Inconsistent Data


Inconsistent with our models or common sense
Examples





2/4/03
The same name occurs differently in an application
Different names appear the same (Dennis vs. Denis)
Inappropriate values (Male-Pregnant, negative age)
One bank’s database shows that 5% of its customers
were born in 11/11/11
…
CSE 575 Data Mining by H. Liu
11
Dimensionality Reduction

Feature selection




select m from n features, m≤ n
remove irrelevant, redundant features
the saving in search space
Feature transformation (PCA)



2/4/03
form new features (a) in a new domain from original
features (f)
many uses, but it does not reduce the original
dimensionality
often used in visualization of data
CSE 575 Data Mining by H. Liu
12
Feature Selection

Problem illustration




Full set
Empty set
Enumeration
Search




2/4/03
Exhaustive/Complete (Enumeration/BAA)
Heuristic (Sequential forward/backward)
Stochastic (generate/evaluate)
Individual features or subsets
generation/evaluation
CSE 575 Data Mining by H. Liu
13
Feature Selection (2)

Goodness metrics






Dependency: depending on classes
Distance: separating classes
Information: entropy
Consistency: 1 - #inconsistencies/N
 Example: (F1, F2, F3) and (F1,F3)
 Both sets have 2/6 inconsistency rate
Accuracy (classifier based): 1 - errorRate
F1
F2
F3
C
0
0
1
1
0
0
1
0
0
0
1
1
1
0
0
1
1
0
0
0
1
0
0
0
Their comparisons

2/4/03
Time complexity, number of features,
removing redundancy
CSE 575 Data Mining by H. Liu
14
Feature Selection (3)

Filter vs. Wrapper Model


Pros and cons
 time
 generality
 performance such as accuracy
Stopping criteria


2/4/03
thresholding (number of iterations, some accuracy,…)
anytime algorithms
 providing approximate solutions
 solutions improve over time
CSE 575 Data Mining by H. Liu
15
Feature Selection (Examples)

SFS using consistency (cRate)



LVF using consistency (cRate)
1
2
3


select 1 from n, then 1 from n-1, n-2,… features
increase the number of selected features until prespecified cRate is reached.
randomly generate a subset S from the full set
if it satisfies prespecified cRate, keep S with min #S
go back to 1 until a stopping criterion is met
LVF is an any time algorithm
Many other algorithms: SBS, B&B, ...
2/4/03
CSE 575 Data Mining by H. Liu
16
Transformation: PCA

D’ = DA, D is meancentered, (Nn)

Calculate and rank eigenvalues
of the covariance matrix
m
E-values
Diff
Prop
Cumu
1
2.91082
1.98960
0.72771
0.72770
2
0.92122
0.77387
0.23031
0.95801
3
0.14735
0.12675
0.03684
0.99485
4
0.02061
0.00515
1.00000
n
r = (  i ) / (  i )
i=1



i=1
Select largest ’s such that r >
threshold (e.g., .95)
corresponding eigenvectors
form A (nm)
Example of Iris data
2/4/03
V1
V2
V3
V4
F1
0.522372
0.372318
-.721017
-.261996
F2
-.263355
0.925556
0.242033
0.124135
F3
0.581254
0.021095
0.140892
0.801154
F4
0.565611
0.065416
0.633801
-.523546
CSE 575 Data Mining by H. Liu
17
Instance Selection

Sampling methods



random sampling
stratified sampling
Search-based methods




2/4/03
Representatives
Prototypes
Sufficient statistics (N, mean, stdDev)
Support vectors
CSE 575 Data Mining by H. Liu
18
Value Descritization

Binning methods





Equal-width
Equal-frequency
Class information is not used
Entropy-based
ChiMerge

2/4/03
Chi2
CSE 575 Data Mining by H. Liu
19
Binning

Attribute values (for one attribute e.g., age):


Equi-width binning – for bin width of e.g., 10:





Bin 1: 0, 4
[-,10) bin
Bin 2: 12, 16, 16, 18
[10,20) bin
Bin 3: 24, 26, 28
[20,+) bin
We use – to denote negative infinity, + for positive infinity
Equi-frequency binning – for bin density of e.g., 3:




0, 4, 12, 16, 16, 18, 24, 26, 28
Bin 1: 0, 4, 12
Bin 2: 16, 16, 18
Bin 3: 24, 26, 28
[-,14) bin
[14,21) bin
[21,+] bin
Any problems with the above methods?
2/4/03
CSE 575 Data Mining by H. Liu
20
Entropy-based

Given attribute-value/class pairs:


Entropy-based binning via binarization:




(0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N),
(28,N)
Intuitively, find best split so that the bins are as pure as possible
Formally characterized by maximal information gain.
Let S denote the above 9 pairs, p=4/9 be fraction of P
pairs, and n=5/9 be fraction of N pairs.
Entropy(S) = - p log p - n log n.


2/4/03
Smaller entropy – set is relatively pure; smallest is 0.
Large entropy – set is mixed. Largest is 1.
CSE 575 Data Mining by H. Liu
21
Entropy-based (2)

Let v be a possible split. Then S is divided into two sets:


Information of the split:



I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)
Information gain of the split:


S1: value <= v and S2: value > v
Gain(v,S) = Entropy(S) – I(S1,S2)
Goal: split with maximal information gain.
Possible splits: mid points b/w any two consecutive values.

For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433
Gain(14,S) = Entropy(S) - 0.433
 maximum Gain means minimum I.

The best split is found after examining all possible split points.

2/4/03
CSE 575 Data Mining by H. Liu
22
ChiMerge and Chi2



Given attribute-value/class pairs
Build a contingency table for every
pair of intervals (I)
Chi-Squared Test (goodness-of-fit),
2 k
2 = 
 (Aij – Eij
)2 /
Eij
i=1 j=1

Parameters: df = k-1 and p% level
of significance

2/4/03
Chi2 algorithm provides an automatic
way to adjust p
CSE 575 Data Mining by H. Liu
C1
C2

I-1
A11
A12
R1
I-2
A21
A22
R2

C1
C2
N
F
C
12
P
12
N
12
P
16
N
16
N
16
P
24
N
24
N
24
N
23
Summary

Data have many forms


Raw data need to be prepared and preprocessed
for data mining





Attribute-vectors is the most common form
Data miners have to work on the data provided
Domain expertise is important in DPP
Data preparation: Normalization, Transformation
Data preprocessing: Cleaning and Reduction
DPP is a critical and time-consuming task

2/4/03
Why?
CSE 575 Data Mining by H. Liu
24
Bibliography




H. Liu & H. Motoda, 1998. Feature Selection for
Knowledge Discovery and Data Mining. Kluwer.
M. Kantardzic, 2003. Data Mining - Concepts, Models,
Methods, and Algorithms. IEEE and Wiley InterScience.
H. Liu & H. Motoda, edited, 2001. Instance Selection
and Construction for Data Mining. Kluwer.
H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002.
Discretization: An Enabling Technique. DMKD 6:393423.
2/4/03
CSE 575 Data Mining by H. Liu
25
Related documents