Download User Guide Numerical Data Preprocessing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
User Guide
Numerical Data Preprocessing
Analytic Description
Data preprocessing means manipulation of data into a form suitable for further analysis and
processing. It is essential for successful data mining. Poor quality data typically result in incorrect
and unreliable data mining results.
Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends,
and is likely to contain many errors. Data preprocessing is a proven method of resolving such
issues.
Data preprocessing includes data cleaning, transformations, standardizations, normalizations,
data binning, feature extractions, handling outliers and/or null values, imputing missing data, etc..
The product of data preprocessing is the final training set.
In this catalog item we focus on the followings:
-
Standardizations
-
Normalizations
-
Transformations
-
Binarization
-
Binnings
For more information, see Wiki or Scikit-learn.
Standardization
Standardization of datasets is a common requirement for many machine learning estimators.
They might behave badly if the individual features do not more or less look like standard normally
distributed data.
For more information, see Scikit-learn Preprocessing.
Z-score
In practice we often ignore the shape of the distribution and just transform the data to center it by
removing the mean value of each feature, then scale it by dividing non-constant features by their
standard deviation.
For instance, many elements used in the objective function of a learning algorithm (such as the
RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that
all features are centered around zero and have variance in the same order. If a feature has a
variance that is orders of magnitude larger than others, it might dominate the objective function
and make the estimator unable to learn from other features correctly as expected.
When Z-score standardization, we subtract from all samples the mean value of the train data and
divide the results with the standard deviation of train data.
11/02/16
1
For more information, see Scikit-learn StandardScaler.
Z-score equation:
Min-Max
An alternative standardization is scaling features to lie between zero and one. This can be
achieved using Min-Max standardization.
The motivation to use this scaling include robustness to very small standard deviations of features
and preserving zero entries in sparse data.
When Min-Max standardization, we subtract from all samples the mean of the train data and
divide the results with the min-max distance of the train data.
For more information, see Scikit-learn MinMaxScaler.
Min-Max equation:
MaxAbs
MaxAbs standardization works in a very similar fashion, but scales in a way that the training data
lies within the range [-1, 1] by dividing through the largest maximum value in each feature. It is
meant for data that is already centered at zero or sparse data.
When MaxAbs standardization, we divide all samples with the maximum value of the absolute
value of the train data.
For more information, see Scikit-learn MaxAbsScaler.
MaxAbs equation:
Robust
If your data contains many outliers, scaling using the mean and variance of the data is likely to not
work very well. In these cases, you can use Robust standardization. It uses more robust estimates
for the center and range of your data.
When Robust standardization, we subtract from all samples the mean value of the train data and
divide the results with the IQR value of the train data. The IQR (Interquartile range) value is the
11/02/16
2
distance between the 75th and 25th percentiles.
For more information, see Scikit-learn RobustScaler.
Robust equation:
Normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be
useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify
the similarity of any pair of samples.
This assumption is the base of the Vector Space Model often used in text classification and
clustering contexts.
For more information, see Scikit-learn Preprocessing or Scikit-learn Normalization.
L1
When L1 normalization, we divide the sample data with the sum of the absolute value of the train
data.
L1 equation:
L2
When L2 normalization, we divide the sample data with root of the sum of squares of the train
data.
L2 equation:
Max
When Max normalization, we simply just divide the sample data with the maximum value of the
train data.
Max equation:
11/02/16
3
Binarization
Feature binarization is the process of thresholding numerical features to get boolean values. This
can be useful for downstream probabilistic estimators that make assumption that the input data is
distributed according to a multi-variate Bernoulli distribution.
It is also common among the text processing community to use binary feature values (probably to
simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF
valued features often perform slightly better in practice.
When binarization, we assign 0 or 1 values to the sample data according to the mean of the
sample data or a given threshold.
For more information, see Scikit-learn Preprocessing or Scikit-learn Binarization.
Transformation
A transformation forwards the sample data to a user-defined function and returns the result of
this function. This is useful for stateless transformations such as taking the log of frequencies,
doing custom scaling, etc.
When using transformations, we can get the LOG, LOG1P or SQRT values of the sample data.
For more information, see Scikit-learn Transformations.
Binning
Data binning or bucketing is a data preprocessing technique used to reduce the effects of minor
observation errors. The original data values which fall in a given small interval, a bin, are replaced
by a value representative of that interval.
Statistical data binning is a way to group a number of more or less continuous values into a
smaller number of "bins". For example, if you have data about a group of people, you might want
to arrange their ages into a smaller number of age intervals.
For more information, see Wiki.
Equal-width
When equal-width binning, we divide the data into k bins of equal size. The size of a bin equals to
the histogram of the data dividing into k equal-width parts.
Equal-width binning equations:
Example of equal-width binning with 3 bins:
Data:
0
1
2
2
5
5
BinWidth:
6
9
9
(9-0)/3 = 3
BinBounds: Bin1 = 0, Bin2 = 0+3 = 3, Bin3 = 0+2*3 = 6
Results:
1
1
1
1
2
2
3
3
3
Equal-frequency
11/02/16
4
When equal-frequency (or equal-freq) binning, we divide the data into k bins of approximately
equal value counts. The size of a bin equals to the sum of counts dividing into k equal-size parts.
The algorithm sorts the data and starts to assign the bin values. When a bin value reaches the bin
size, then assigns the next bin value. The same values go the same bins.
Equal-freq binning equation:
Example of equal-freq binning with 3 bins:
Data:
0
1
2
3
5
5
BinSize:
9/3 = 3
Results:
1
1
1
2
2
2
6
9
9
3
3
3
BinCounts: Bin1 = 3, Bin2 = 3, Bin3 = 3
Data:
0
BinSize:
9/3 = 3
Results:
1
1
1
2
2
5
5
6
9
9
1
1
2
2
2
3
3
BinCounts: Bin1 = 4 (because of duplicates), Bin2 = 3, Bin3 = 2 (because of Bin1)
Input Description
The input has two main components stored in JSON format.
{"data": data, "config": config}
data
The input contains the raw data stored as list, which we want to preprocess. The input can contain
either just one data set or two data sets (train and test samples). If there are two data sets, then
the transformations/statistics derived from the train data set will be applied to the test data set.
One input
Example Input
{"data": [0.0, 10.0, 10.0, 10.0, 10.0, 20.0, 20.0, 30.0, 30.0, 90.0, 90.0,
900.0]}
Two inputs
Example Input
{"data": {"train": [0.0, 10.0, 10.0, 10.0, 10.0, 20.0, 20.0, 30.0, 30.0,
90.0, 90.0, 900.0], "test": [2.0, 3.0, 5.0, -10.0, 11.0, 11.0, 50.0, 50.0,
110.0, 120.0, 600.0, 700.0]}}
config
Standardization
Example Z-score:
{"input": "one", "method": "standardization", "type": "zscore"}
11/02/16
5
Example Min-Max:
{"input": "one", "method": "standardization", "type": "minmax"}
Example MaxAbs:
{"input": "one", "method": "standardization", "type": "maxabs"}
Example Robust:
{"input": "one", "method": "standardization", "type": "robust"}
Normalization
Example L1:
{"input": "one", "method": "normalization", "type": "l1"}
Example L2:
{"input": "one", "method": "normalization", "type": "l2"}
Example Max:
{"input": "one", "method": "normalization", "type": "max"}
Binarization
Example Mean threshold:
{"input": "one", "method": "binarization", "threshold": "mean"}
Example Custom threshold:
{"input": "one", "method": "binarization", "threshold": "1.5"}
Transformation
Example Log1p:
{"input": "one", "method": "transformation", "type": "log1p"}
Example Log:
{"input": "one", "method": "transformation", "type": "log"}
Example Sqrt:
{"input": "one", "method": "transformation", "type": "sqrt"}
Binning
Example Equal-width:
{"input": "one", "method": "binning", "type": "width", "bins": 3}
Example Equal-freq:
{"input": "one", "method": "binning", "type": "freq", "bins": 3}
Output Description
The algorithm returns the transformed data set(s) as list of floats in JSON format. There will be as
many outputs as inputs.
One output
Example Input
{"result": [-0.41939506918954805, -0.378143095170904, -0.378143095170904,
-0.378143095170904, -0.378143095170904, -0.3368911211522599, 0.3368911211522599, -0.29563914713361583, -0.29563914713361583, 0.04812730302175143, -0.04812730302175143, 3.2932825924884184]}
11/02/16
6
Two outputs
Example Input
{"result": {"test": [-0.41114467438581925, -0.4070194769839548, 0.398769082180226, -0.4606470432081921, -0.37401789776903954, 0.37401789776903954, -0.2131351990963277, -0.2131351990963277,
0.0343766450155367, 0.07562861903418078, 2.055723371929096,
2.468243112115537], "train": [-0.41939506918954805, -0.378143095170904, 0.378143095170904, -0.378143095170904, -0.378143095170904, 0.3368911211522599, -0.3368911211522599, -0.29563914713361583, 0.29563914713361583, -0.04812730302175143, -0.04812730302175143,
3.2932825924884184]}}
Usage samples
Z-score standardization
We created a sample data set and want to standardize it with Z-score method.
# python snippet to standardize a sample data with min-max method
input_data
= np.random.normal(50,10,1000)
input_param = {'data': input_data.tolist()}
config_param = {'input': 'one', 'method': 'standardization', 'type':
'zscore'}
response
= preproc(input_param,config_param)
The following figure shows the data distribution before and after the Z-score standardization.
1. Figure Data distribution before and after Z-score standardization
Min-Max standardization
We created a sample data set and want to standardize it with Min-Max method.
# python snippet to standardize a sample data with min-max method
input_data
= np.random.normal(50,10,1000)
input_param = {'data': input_data.tolist()}
11/02/16
7
config_param = {'input': 'one', 'method': 'standardization', 'type':
'minmax'}
response
= preproc(input_param,config_param)
The following figure shows the data distribution before and after the Min-Max standardization.
2. Figure Data distribution before and after Min-Max standardization
Log1p transformation
We created a sample data set and want to transform it with Log1p method.
# python snippet to transform a sample data with log1p method
input_data
= np.random.uniform(50,10,1000)
input_param = {'data': input_data.tolist()}
config_param = {'input': 'one', 'method': 'transformation', 'type':
'log1p'}
response
= preproc(input_param,config_param)
The following figure shows the data distribution before and after the Log1p transformation.
11/02/16
8
3. Figure Data distribution before and after Log1p transformation
Equal-width binning
We created two sample data sets (one with normal and one with uniform distribution) and want to
bin them with equal-width method.
# python snippet to bin a sample data with equal-width method
input_data
= np.random.normal(50,10,1000)
input_param
= {'data': input_data.tolist()}
config_param = {'input': 'one', 'method': 'binning', 'type': 'width',
'bins': 3}
response_norm = preproc(input_param,config_param)
input_data
= np.random.uniform(50,10,1000)
input_param
= {'data': input_data.tolist()}
response_uni = preproc(input_param,config_param)
The following figure shows the normal data distribution before and after the equal-width binning.
11/02/16
9
4. Figure Normal data distribution before and after Equal-width binning
The following figure shows the uniform data distribution before and after the equal-width binning.
5. Figure Uniform data distribution before and after Equal-width binning
Equal-freq binning
We created two sample data sets (one with normal and one with uniform distribution) and want to
bin them with equal-freq method.
# python snippet to bin a sample data with equal-freq method
input_data
= np.random.normal(50,10,1000)
input_param
= {'data': input_data.tolist()}
config_param = {'input': 'one', 'method': 'binning', 'type': 'freq',
'bins': 3}
response_norm = preproc(input_param,config_param)
input_data
= np.random.uniform(50,10,1000)
11/02/16
10
input_param
response_uni
= {'data': input_data.tolist()}
= preproc(input_param,config_param)
The following figure shows the normal data distribution before and after the equal-freq binning.
6. Figure Normal data distribution before and after Equal-freq binning
The following figure shows the uniform data distribution before and after the equal-freq binning.
7. Figure Uniform data distribution before and after Equal-freq binning
Two inputs standardization
We have two sample data sets (one train and one test) and want to standardize it with Min-Max
method, before giving the data to the modeling.
# python snippet to standardize two sample data sets with min-max method
train_data
= np.random.uniform(50,10,1000)
test_data
= np.random.uniform(50,10,1000)
11/02/16
11
input_param = {'data': {'train': train_data.tolist(), 'test':
test_data.tolist()}}
config_param = {'input': 'two', 'method': 'standardization', 'type':
'minmax'}
response
= preproc(input_param,config_param)
The following figure shows the data distributions before and after the Min-Max standardizations.
8. Figure Train and test data distributions before and after Min-Max standardizations
©2016 General Electric Company – All rights reserved.
GE, the GE Monogram and Predix are trademarks of General
Electric Company.
No part of this document may be distributed, reproduced or
posted without the express written permission of General
Electric Company.
11/02/16
THIS DOCUMENT AND ITS CONTENTS ARE PROVIDED “AS IS,”
WITH NO REPRESENTATION OR WARRANTIES OF ANY KIND,
WHETHER EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO WARRANTIES OF DESIGN,
MERCHANTABILITY, OR FITNESS FOR A PARTICULAR
PURPOSE. ALL OTHER LIABILITY ARISING FROM RELIANCE
UPON ANY INFORMATION CONTAINED HEREIN IS
EXPRESSLY DISCLAIMED.
12