Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Data Reduction Databases/Data warehouses may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Strategies Data cube aggregation Dimensionality reduction Attribute subset selection Numerosity reduction Discretization and concept hierarchy generation 2 Data Cube Aggregation The lowest level of a data cube (base cuboid) Multiple levels of aggregation in data cubes Further reduce the size of data to deal with The highest level of a data cube (apex cuboid) Reference appropriate levels Use the smallest representation which is enough to solve the task 3 Attribute Subset Selection All attributes may not be relevant to the mining task Reduced attributes should result in Less data so faster learning Higher accuracy Simple results If behaviour of data is not known, manual selection of useful attributes may be a time consuming task Careful selection is required Keep relevant attributes Leave out irrelevant attributes 4 Attribute Subset Selection (contd…) Attribute subset selection is a search problem Heuristic methods (due to exponential # of choices), typically greedy: 5 step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction Attribute Subset Selection (contd…) 6 Dimensionality Reduction Data encoding or transformations are applied so as to obtain a reduced or ‘compressed’ representation of original data. Original Data lossless Original Data Approximated 7 Compressed Data Dimensionality Reduction… Data encoding techniques: 8 Huffman algorithm Wavelets Principal components analysis Numerosity Reduction Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Non-parametric methods Do not assume models Major families: histograms, clustering, sampling 9