Download Data Preprocessing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

Geographic information system wikipedia , lookup

Neuroinformatics wikipedia , lookup

Theoretical computer science wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Pattern recognition wikipedia , lookup

Data analysis wikipedia , lookup

Corecursion wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
DATA PREPARATION: Preprocessing
 Data cleaning
 Data discretization
 Binning
 Clustering
 Binarization
 Data integration
 Aggregation
 Smoothing
DATA PREPARATION: Preprocessing
 Data reduction
 Sampling
 Dimensionality reduction
 Feature subset selection
 Feature creation
 Data transformation
 Variable transformation
 Scaling
 Sorting
DATA PREPARATION: Preprocessing
 Mathematical computation
 Normalization
 Stationarity
 Statistical computation
 Mean
 Median
 Mode
 Midrange
 Variance, standard deviation, range
 Weighted mean
DATA PREPARATION: Data cleaning
 Missing data
 Improper data
 Detection and handling of outliers
 Handling noise
DATA PREPARATION: Data cleaning
 Need to determine how to handle missing values
 Why are the values missing?
 Is there significance in the fact that particular values are
missing?
 Need to determine how to handle inaccurate values





How to identify and handle outliers
Typographic errors (e.g. transposition errors)
Measurement errors
Duplicate values
Noise
DATA PREPARATION: Data cleaning
 Need to determine how to handle irrelevant data
 Why is the data deemed irrelevant? Is it data truly irrelevant?
 Need to determine how to handle data timeliness
 How important is the age of the data?
DATA PREPARATION: Data Cleaning
 Possible handling methods
 Eliminate data instances
 Eliminate data attributes
 Estimate missing values – interpolation
 Ignore missing values during analysis
 Identifying inconsistent values during collection
 Check digits
 Smoothing data
DATA PREPARATION: Data Cleaning
 Possible handling methods
 Remove duplicate data
 Careful: Are duplicate instances errors or are they
separate instances with identical values? Machine
learning tools will give different results for repeated data.
 Remove irrelevant data
 Remove dated data
 Weight data by data age
DATA PREPARATION: Discretization
 Binning
 Equal-frequency interval binning
 Equal-width interval binning
 Clustering
 K-means clustering
 Hierarchical methods
 Binarization
 Entropy-based discretization
 Discretization of multiple variables
DATA PREPARATION: Data
integration
 Aggregation
 Smoothing data
 Averaging data
DATA PREPARATION: Data reduction
 Sampling
 Why sample
 Sampling techniques
 Simple random sampling
 With replacement
 Without replacement
 Stratified sampling
 Sample size
DATA PREPARATION: Data reduction
 Dimensionality reduction
 Curse of dimensionality
 Projection into lower-dimensions
 Principal components analysis
 Singular value decomposition
 Feature subset selection
 Remove redundant features
 Remove irrelevant features
 Feature creation
DATA PREPARATION: Data
transformation
 Variable transformation
 Scaling
 Sorting
 Normalization
DATA PREPARATION: Mathematical
Computation
 Normalization
 Stationarity
 Time series – mean and variance are constant
 Statistical computation
 Mean
 Median
 Mode
 Midrange
 Variance, standard deviation, range
 Weighted mean
DATA PREPARATION: Measures of
Similarity and Dissimilarity
 Euclidean distance
 Direction cosines
 Simple matching and Jaccard coefficients
 Tamimoto measure (set similarity)
 Hamming distance
 Edit distance
 Probability based distances
 Mahalanobis distance
DATA PREPARATION: Choosing
Similarity Measures
 Problem specific
 Data dependent
 Domain knowledge
 Purpose
 Metric properties?
 Positivity
 Symmetry
 Triangle Inequality