Download Compiled By

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Data Reduction  Databases/Data warehouses may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set  Data reduction  Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results  Strategies  Data cube aggregation  Dimensionality reduction  Attribute subset selection  Numerosity reduction  Discretization and concept hierarchy generation 2 Data Cube Aggregation  The lowest level of a data cube (base cuboid)  Multiple levels of aggregation in data cubes  Further reduce the size of data to deal with  The highest level of a data cube (apex cuboid)  Reference appropriate levels  Use the smallest representation which is enough to solve the task 3 Attribute Subset Selection  All attributes may not be relevant to the mining task  Reduced attributes should result in  Less data so faster learning  Higher accuracy  Simple results  If behaviour of data is not known, manual selection of useful attributes may be a time consuming task  Careful selection is required  Keep relevant attributes  Leave out irrelevant attributes 4 Attribute Subset Selection (contd…)  Attribute subset selection is a search problem  Heuristic methods (due to exponential # of choices), typically greedy:     5 step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction Attribute Subset Selection (contd…) 6 Dimensionality Reduction  Data encoding or transformations are applied so as to obtain a reduced or ‘compressed’ representation of original data. Original Data lossless Original Data Approximated 7 Compressed Data Dimensionality Reduction…  Data encoding techniques:    8 Huffman algorithm Wavelets Principal components analysis Numerosity Reduction  Reduce data volume by choosing alternative, smaller forms of data representation  Parametric methods  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Non-parametric methods  Do not assume models  Major families: histograms, clustering, sampling 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Compiled By