Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
User Guide Numerical Data Preprocessing Analytic Description Data preprocessing means manipulation of data into a form suitable for further analysis and processing. It is essential for successful data mining. Poor quality data typically result in incorrect and unreliable data mining results. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing includes data cleaning, transformations, standardizations, normalizations, data binning, feature extractions, handling outliers and/or null values, imputing missing data, etc.. The product of data preprocessing is the final training set. In this catalog item we focus on the followings: - Standardizations - Normalizations - Transformations - Binarization - Binnings For more information, see Wiki or Scikit-learn. Standardization Standardization of datasets is a common requirement for many machine learning estimators. They might behave badly if the individual features do not more or less look like standard normally distributed data. For more information, see Scikit-learn Preprocessing. Z-score In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected. When Z-score standardization, we subtract from all samples the mean value of the train data and divide the results with the standard deviation of train data. 11/02/16 1 For more information, see Scikit-learn StandardScaler. Z-score equation: Min-Max An alternative standardization is scaling features to lie between zero and one. This can be achieved using Min-Max standardization. The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data. When Min-Max standardization, we subtract from all samples the mean of the train data and divide the results with the min-max distance of the train data. For more information, see Scikit-learn MinMaxScaler. Min-Max equation: MaxAbs MaxAbs standardization works in a very similar fashion, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data. When MaxAbs standardization, we divide all samples with the maximum value of the absolute value of the train data. For more information, see Scikit-learn MaxAbsScaler. MaxAbs equation: Robust If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use Robust standardization. It uses more robust estimates for the center and range of your data. When Robust standardization, we subtract from all samples the mean value of the train data and divide the results with the IQR value of the train data. The IQR (Interquartile range) value is the 11/02/16 2 distance between the 75th and 25th percentiles. For more information, see Scikit-learn RobustScaler. Robust equation: Normalization Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples. This assumption is the base of the Vector Space Model often used in text classification and clustering contexts. For more information, see Scikit-learn Preprocessing or Scikit-learn Normalization. L1 When L1 normalization, we divide the sample data with the sum of the absolute value of the train data. L1 equation: L2 When L2 normalization, we divide the sample data with root of the sum of squares of the train data. L2 equation: Max When Max normalization, we simply just divide the sample data with the maximum value of the train data. Max equation: 11/02/16 3 Binarization Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. It is also common among the text processing community to use binary feature values (probably to simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly better in practice. When binarization, we assign 0 or 1 values to the sample data according to the mean of the sample data or a given threshold. For more information, see Scikit-learn Preprocessing or Scikit-learn Binarization. Transformation A transformation forwards the sample data to a user-defined function and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc. When using transformations, we can get the LOG, LOG1P or SQRT values of the sample data. For more information, see Scikit-learn Transformations. Binning Data binning or bucketing is a data preprocessing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval. Statistical data binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals. For more information, see Wiki. Equal-width When equal-width binning, we divide the data into k bins of equal size. The size of a bin equals to the histogram of the data dividing into k equal-width parts. Equal-width binning equations: Example of equal-width binning with 3 bins: Data: 0 1 2 2 5 5 BinWidth: 6 9 9 (9-0)/3 = 3 BinBounds: Bin1 = 0, Bin2 = 0+3 = 3, Bin3 = 0+2*3 = 6 Results: 1 1 1 1 2 2 3 3 3 Equal-frequency 11/02/16 4 When equal-frequency (or equal-freq) binning, we divide the data into k bins of approximately equal value counts. The size of a bin equals to the sum of counts dividing into k equal-size parts. The algorithm sorts the data and starts to assign the bin values. When a bin value reaches the bin size, then assigns the next bin value. The same values go the same bins. Equal-freq binning equation: Example of equal-freq binning with 3 bins: Data: 0 1 2 3 5 5 BinSize: 9/3 = 3 Results: 1 1 1 2 2 2 6 9 9 3 3 3 BinCounts: Bin1 = 3, Bin2 = 3, Bin3 = 3 Data: 0 BinSize: 9/3 = 3 Results: 1 1 1 2 2 5 5 6 9 9 1 1 2 2 2 3 3 BinCounts: Bin1 = 4 (because of duplicates), Bin2 = 3, Bin3 = 2 (because of Bin1) Input Description The input has two main components stored in JSON format. {"data": data, "config": config} data The input contains the raw data stored as list, which we want to preprocess. The input can contain either just one data set or two data sets (train and test samples). If there are two data sets, then the transformations/statistics derived from the train data set will be applied to the test data set. One input Example Input {"data": [0.0, 10.0, 10.0, 10.0, 10.0, 20.0, 20.0, 30.0, 30.0, 90.0, 90.0, 900.0]} Two inputs Example Input {"data": {"train": [0.0, 10.0, 10.0, 10.0, 10.0, 20.0, 20.0, 30.0, 30.0, 90.0, 90.0, 900.0], "test": [2.0, 3.0, 5.0, -10.0, 11.0, 11.0, 50.0, 50.0, 110.0, 120.0, 600.0, 700.0]}} config Standardization Example Z-score: {"input": "one", "method": "standardization", "type": "zscore"} 11/02/16 5 Example Min-Max: {"input": "one", "method": "standardization", "type": "minmax"} Example MaxAbs: {"input": "one", "method": "standardization", "type": "maxabs"} Example Robust: {"input": "one", "method": "standardization", "type": "robust"} Normalization Example L1: {"input": "one", "method": "normalization", "type": "l1"} Example L2: {"input": "one", "method": "normalization", "type": "l2"} Example Max: {"input": "one", "method": "normalization", "type": "max"} Binarization Example Mean threshold: {"input": "one", "method": "binarization", "threshold": "mean"} Example Custom threshold: {"input": "one", "method": "binarization", "threshold": "1.5"} Transformation Example Log1p: {"input": "one", "method": "transformation", "type": "log1p"} Example Log: {"input": "one", "method": "transformation", "type": "log"} Example Sqrt: {"input": "one", "method": "transformation", "type": "sqrt"} Binning Example Equal-width: {"input": "one", "method": "binning", "type": "width", "bins": 3} Example Equal-freq: {"input": "one", "method": "binning", "type": "freq", "bins": 3} Output Description The algorithm returns the transformed data set(s) as list of floats in JSON format. There will be as many outputs as inputs. One output Example Input {"result": [-0.41939506918954805, -0.378143095170904, -0.378143095170904, -0.378143095170904, -0.378143095170904, -0.3368911211522599, 0.3368911211522599, -0.29563914713361583, -0.29563914713361583, 0.04812730302175143, -0.04812730302175143, 3.2932825924884184]} 11/02/16 6 Two outputs Example Input {"result": {"test": [-0.41114467438581925, -0.4070194769839548, 0.398769082180226, -0.4606470432081921, -0.37401789776903954, 0.37401789776903954, -0.2131351990963277, -0.2131351990963277, 0.0343766450155367, 0.07562861903418078, 2.055723371929096, 2.468243112115537], "train": [-0.41939506918954805, -0.378143095170904, 0.378143095170904, -0.378143095170904, -0.378143095170904, 0.3368911211522599, -0.3368911211522599, -0.29563914713361583, 0.29563914713361583, -0.04812730302175143, -0.04812730302175143, 3.2932825924884184]}} Usage samples Z-score standardization We created a sample data set and want to standardize it with Z-score method. # python snippet to standardize a sample data with min-max method input_data = np.random.normal(50,10,1000) input_param = {'data': input_data.tolist()} config_param = {'input': 'one', 'method': 'standardization', 'type': 'zscore'} response = preproc(input_param,config_param) The following figure shows the data distribution before and after the Z-score standardization. 1. Figure Data distribution before and after Z-score standardization Min-Max standardization We created a sample data set and want to standardize it with Min-Max method. # python snippet to standardize a sample data with min-max method input_data = np.random.normal(50,10,1000) input_param = {'data': input_data.tolist()} 11/02/16 7 config_param = {'input': 'one', 'method': 'standardization', 'type': 'minmax'} response = preproc(input_param,config_param) The following figure shows the data distribution before and after the Min-Max standardization. 2. Figure Data distribution before and after Min-Max standardization Log1p transformation We created a sample data set and want to transform it with Log1p method. # python snippet to transform a sample data with log1p method input_data = np.random.uniform(50,10,1000) input_param = {'data': input_data.tolist()} config_param = {'input': 'one', 'method': 'transformation', 'type': 'log1p'} response = preproc(input_param,config_param) The following figure shows the data distribution before and after the Log1p transformation. 11/02/16 8 3. Figure Data distribution before and after Log1p transformation Equal-width binning We created two sample data sets (one with normal and one with uniform distribution) and want to bin them with equal-width method. # python snippet to bin a sample data with equal-width method input_data = np.random.normal(50,10,1000) input_param = {'data': input_data.tolist()} config_param = {'input': 'one', 'method': 'binning', 'type': 'width', 'bins': 3} response_norm = preproc(input_param,config_param) input_data = np.random.uniform(50,10,1000) input_param = {'data': input_data.tolist()} response_uni = preproc(input_param,config_param) The following figure shows the normal data distribution before and after the equal-width binning. 11/02/16 9 4. Figure Normal data distribution before and after Equal-width binning The following figure shows the uniform data distribution before and after the equal-width binning. 5. Figure Uniform data distribution before and after Equal-width binning Equal-freq binning We created two sample data sets (one with normal and one with uniform distribution) and want to bin them with equal-freq method. # python snippet to bin a sample data with equal-freq method input_data = np.random.normal(50,10,1000) input_param = {'data': input_data.tolist()} config_param = {'input': 'one', 'method': 'binning', 'type': 'freq', 'bins': 3} response_norm = preproc(input_param,config_param) input_data = np.random.uniform(50,10,1000) 11/02/16 10 input_param response_uni = {'data': input_data.tolist()} = preproc(input_param,config_param) The following figure shows the normal data distribution before and after the equal-freq binning. 6. Figure Normal data distribution before and after Equal-freq binning The following figure shows the uniform data distribution before and after the equal-freq binning. 7. Figure Uniform data distribution before and after Equal-freq binning Two inputs standardization We have two sample data sets (one train and one test) and want to standardize it with Min-Max method, before giving the data to the modeling. # python snippet to standardize two sample data sets with min-max method train_data = np.random.uniform(50,10,1000) test_data = np.random.uniform(50,10,1000) 11/02/16 11 input_param = {'data': {'train': train_data.tolist(), 'test': test_data.tolist()}} config_param = {'input': 'two', 'method': 'standardization', 'type': 'minmax'} response = preproc(input_param,config_param) The following figure shows the data distributions before and after the Min-Max standardizations. 8. Figure Train and test data distributions before and after Min-Max standardizations ©2016 General Electric Company – All rights reserved. GE, the GE Monogram and Predix are trademarks of General Electric Company. No part of this document may be distributed, reproduced or posted without the express written permission of General Electric Company. 11/02/16 THIS DOCUMENT AND ITS CONTENTS ARE PROVIDED “AS IS,” WITH NO REPRESENTATION OR WARRANTIES OF ANY KIND, WHETHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF DESIGN, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. ALL OTHER LIABILITY ARISING FROM RELIANCE UPON ANY INFORMATION CONTAINED HEREIN IS EXPRESSLY DISCLAIMED. 12