Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Inverse problem wikipedia , lookup
Geographic information system wikipedia , lookup
Neuroinformatics wikipedia , lookup
Theoretical computer science wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Multidimensional empirical mode decomposition wikipedia , lookup
Pattern recognition wikipedia , lookup
Data analysis wikipedia , lookup
DATA PREPARATION: Preprocessing Data cleaning Data discretization Binning Clustering Binarization Data integration Aggregation Smoothing DATA PREPARATION: Preprocessing Data reduction Sampling Dimensionality reduction Feature subset selection Feature creation Data transformation Variable transformation Scaling Sorting DATA PREPARATION: Preprocessing Mathematical computation Normalization Stationarity Statistical computation Mean Median Mode Midrange Variance, standard deviation, range Weighted mean DATA PREPARATION: Data cleaning Missing data Improper data Detection and handling of outliers Handling noise DATA PREPARATION: Data cleaning Need to determine how to handle missing values Why are the values missing? Is there significance in the fact that particular values are missing? Need to determine how to handle inaccurate values How to identify and handle outliers Typographic errors (e.g. transposition errors) Measurement errors Duplicate values Noise DATA PREPARATION: Data cleaning Need to determine how to handle irrelevant data Why is the data deemed irrelevant? Is it data truly irrelevant? Need to determine how to handle data timeliness How important is the age of the data? DATA PREPARATION: Data Cleaning Possible handling methods Eliminate data instances Eliminate data attributes Estimate missing values – interpolation Ignore missing values during analysis Identifying inconsistent values during collection Check digits Smoothing data DATA PREPARATION: Data Cleaning Possible handling methods Remove duplicate data Careful: Are duplicate instances errors or are they separate instances with identical values? Machine learning tools will give different results for repeated data. Remove irrelevant data Remove dated data Weight data by data age DATA PREPARATION: Discretization Binning Equal-frequency interval binning Equal-width interval binning Clustering K-means clustering Hierarchical methods Binarization Entropy-based discretization Discretization of multiple variables DATA PREPARATION: Data integration Aggregation Smoothing data Averaging data DATA PREPARATION: Data reduction Sampling Why sample Sampling techniques Simple random sampling With replacement Without replacement Stratified sampling Sample size DATA PREPARATION: Data reduction Dimensionality reduction Curse of dimensionality Projection into lower-dimensions Principal components analysis Singular value decomposition Feature subset selection Remove redundant features Remove irrelevant features Feature creation DATA PREPARATION: Data transformation Variable transformation Scaling Sorting Normalization DATA PREPARATION: Mathematical Computation Normalization Stationarity Time series – mean and variance are constant Statistical computation Mean Median Mode Midrange Variance, standard deviation, range Weighted mean DATA PREPARATION: Measures of Similarity and Dissimilarity Euclidean distance Direction cosines Simple matching and Jaccard coefficients Tamimoto measure (set similarity) Hamming distance Edit distance Probability based distances Mahalanobis distance DATA PREPARATION: Choosing Similarity Measures Problem specific Data dependent Domain knowledge Purpose Metric properties? Positivity Symmetry Triangle Inequality