Download Data Preprocessing in Python

1 Data Preprocessing in Python Ahmedul Kabir TA, CS 548, Spring 2015 2 Preprocessing Techniques Covered Standardization and Normalization Missing value replacement Resampling Discretization Feature Selection Dimensionality Reduction: PCA 3 Python Packages/Tools for Data Mining Scikit-learn Orange Pandas MLPy MDP PyBrain … and many more 4 Some Other Basic Packages  NumPy and SciPy  Fundamental Packages for scientific computing with Python  Contains powerful n-dimensional array objects  Useful linear algebra, random number and other capabilities  Pandas  Contains useful data structures and algorithms  Matplotlib  Contains functions for plotting/visualizing data. 5 Standardization and Normalization  Standardization: To transform data so that it has zero mean and unit variance. Also called scaling  Use function sklearn.preprocessing.scale()  Parameters:  X: Data to be scaled  with_mean: Boolean. Whether to center the data (make zero mean)  with_std: Boolean (whether to make unit standard deviation  Normalization: to transform data so that it is scaled to the [0,1] range.  Use function sklearn.preprocessing.normalize()  Parameters:  X: Data to be normalized  norm: which norm to use: l1 or l2  axis: whether to normalize by row or column 6 Example code of Standardization/Scaling >>> from sklearn import preprocessing >>> import numpy as np >>> X = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) >>> X_scaled = preprocessing.scale(X) >>> X_scaled array([[ 0. ..., -1.22..., 1.33...], [ 1.22..., 0. ..., -0.26...], [-1.22..., 1.22..., -1.06...]]) 7 Missing Value Replacement  In scikit-learn, this is referred to as “Imputation”  Class be used sklearn.preprocessing.Imputer  Important parameters:  strategy: What to replace the missing value with: mean / median / most_frequent  axis: Boolean. Whether to replace along rows or columns  Attribute:  statistics_ : The imputer-filled values for each feature  Important methods  fit(X[, y]) Fit the model with X.  transform(X) Replace all the missing values in X. 8 Example code for Replacing Missing Values >>> import numpy as np >>> from sklearn.preprocessing import Imputer >>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0) >>> imp.fit([[1, 2], [np.nan, 3], [7, 6]]) Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0) >>> X = [[np.nan, 2], [6, np.nan], [7, 6]] >>> print(imp.transform(X)) [[ 4. 2. ] [ 6. 3.666...] [ 7. 6. ]] 9 Resampling  Using class sklearn.utils.resample  Important parameters:  n_sample: No. of samples to keep  replace: Boolean. Whether to resample with or without replacement  Returns sequence of resampled views of the collections. The original arrays are not impacted.  Another useful class is sklearn.utils.shuffle 10 Discretization  Scikit-learn doesn’t have a direct class that performs discretization.  Can be performed with cut and qcut functions available in pandas.  Orange has discretization functions in Orange.feature.discretization 11 Feature Selection  The sklearn.feature_selection module implements feature selection algorithms.  Some classes in this module are:  GenericUnivariateSelect: Univariate feature selector based on statistical tests.  SelectKBest: Select features according to the k highest scores.  RFE: Feature ranking with recursive feature elimination.  VarianceThreshold: Feature selector that removes all low-variance features.  Scikit-learn does not have a CFS implementation, but RFE works in somewhat similar fashion. 12 Dimensionality Reduction: PCA  The sklearn.decomposition module includes matrix decomposition algorithms, including PCA  sklearn.decomposition.PCA class  Important parameters:  n_components: No. of components to keep  Important attributes:  components_ : Components with maximum variance  explained_variance_ratio_ : Percentage of variance explained by each of the selected components  Important methods  fit(X[, y]) Fit the model with X.  score_samples(X) Return the log-likelihood of each sample  transform(X) Apply the dimensionality reduction on X. 13 Other Useful Information  Generate a random permutation of numbers 1.… n: numpy.random.permutation(n)  You can randomly generate some toy datasets using Sample generators in sklearn.datasets  Scikit-learn doesn’t directly handle categorical/nominal attributes well. In order to use them in the dataset, some sort of encoding needs to be performed.  One good way to encode categorical attributes: if there are n categories, create n dummy binary variables representing each category.  Can be done easily using the sklearn.preprocessing.oneHotEncoder class. 14 References  Preprocessing Modules: http://scikit-learn.org/stable/modules/preprocessing.html  Video Tutorial: http://conference.scipy.org/scipy2013/tutorial_detail.php?id=107  Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html  User Guide http://scikit-learn.org/stable/user_guide.html  API Reference http://scikit-learn.org/stable/modules/classes.html  Example Gallery http://scikit-learn.org/stable/auto_examples/index.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Preprocessing in Python