Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia, lookup

Geographic information system wikipedia, lookup

Neuroinformatics wikipedia, lookup

Theoretical computer science wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

Multidimensional empirical mode decomposition wikipedia, lookup

Pattern recognition wikipedia, lookup

Data analysis wikipedia, lookup

Corecursion wikipedia, lookup

Data assimilation wikipedia, lookup

Transcript
```DATA PREPARATION: Preprocessing
 Data cleaning
 Data discretization
 Binning
 Clustering
 Binarization
 Data integration
 Aggregation
 Smoothing
DATA PREPARATION: Preprocessing
 Data reduction
 Sampling
 Dimensionality reduction
 Feature subset selection
 Feature creation
 Data transformation
 Variable transformation
 Scaling
 Sorting
DATA PREPARATION: Preprocessing
 Mathematical computation
 Normalization
 Stationarity
 Statistical computation
 Mean
 Median
 Mode
 Midrange
 Variance, standard deviation, range
 Weighted mean
DATA PREPARATION: Data cleaning
 Missing data
 Improper data
 Detection and handling of outliers
 Handling noise
DATA PREPARATION: Data cleaning
 Need to determine how to handle missing values
 Why are the values missing?
 Is there significance in the fact that particular values are
missing?
 Need to determine how to handle inaccurate values





How to identify and handle outliers
Typographic errors (e.g. transposition errors)
Measurement errors
Duplicate values
Noise
DATA PREPARATION: Data cleaning
 Need to determine how to handle irrelevant data
 Why is the data deemed irrelevant? Is it data truly irrelevant?
 Need to determine how to handle data timeliness
 How important is the age of the data?
DATA PREPARATION: Data Cleaning
 Possible handling methods
 Eliminate data instances
 Eliminate data attributes
 Estimate missing values – interpolation
 Ignore missing values during analysis
 Identifying inconsistent values during collection
 Check digits
 Smoothing data
DATA PREPARATION: Data Cleaning
 Possible handling methods
 Remove duplicate data
 Careful: Are duplicate instances errors or are they
separate instances with identical values? Machine
learning tools will give different results for repeated data.
 Remove irrelevant data
 Remove dated data
 Weight data by data age
DATA PREPARATION: Discretization
 Binning
 Equal-frequency interval binning
 Equal-width interval binning
 Clustering
 K-means clustering
 Hierarchical methods
 Binarization
 Entropy-based discretization
 Discretization of multiple variables
DATA PREPARATION: Data
integration
 Aggregation
 Smoothing data
 Averaging data
DATA PREPARATION: Data reduction
 Sampling
 Why sample
 Sampling techniques
 Simple random sampling
 With replacement
 Without replacement
 Stratified sampling
 Sample size
DATA PREPARATION: Data reduction
 Dimensionality reduction
 Curse of dimensionality
 Projection into lower-dimensions
 Principal components analysis
 Singular value decomposition
 Feature subset selection
 Remove redundant features
 Remove irrelevant features
 Feature creation
DATA PREPARATION: Data
transformation
 Variable transformation
 Scaling
 Sorting
 Normalization
DATA PREPARATION: Mathematical
Computation
 Normalization
 Stationarity
 Time series – mean and variance are constant
 Statistical computation
 Mean
 Median
 Mode
 Midrange
 Variance, standard deviation, range
 Weighted mean
DATA PREPARATION: Measures of
Similarity and Dissimilarity
 Euclidean distance
 Direction cosines
 Simple matching and Jaccard coefficients
 Tamimoto measure (set similarity)
 Hamming distance
 Edit distance
 Probability based distances
 Mahalanobis distance
DATA PREPARATION: Choosing
Similarity Measures
 Problem specific
 Data dependent
 Domain knowledge
 Purpose
 Metric properties?
 Positivity
 Symmetry
 Triangle Inequality
```