Download Compiled By

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Database model wikipedia, lookup

Functional Database Model wikipedia, lookup

Entity–attribute–value model wikipedia, lookup

Big data wikipedia, lookup

Data Preprocessing
Compiled By:
Umair Yaqub
Govt. Murray College Sialkot
Data Reduction
 Databases/Data warehouses may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the complete data set
 Data reduction
 Obtains a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results
 Strategies
 Data cube aggregation
 Dimensionality reduction
 Attribute subset selection
 Numerosity reduction
 Discretization and concept hierarchy generation
Data Cube Aggregation
 The lowest level of a data cube (base cuboid)
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 The highest level of a data cube (apex cuboid)
 Reference appropriate levels
 Use the smallest representation which is enough to solve the task
Attribute Subset Selection
 All attributes may not be relevant to the mining task
 Reduced attributes should result in
 Less data so faster learning
 Higher accuracy
 Simple results
 If behaviour of data is not known, manual selection of useful attributes may
be a time consuming task
 Careful selection is required
 Keep relevant attributes
 Leave out irrelevant attributes
Attribute Subset Selection (contd…)
 Attribute subset selection is a search problem
 Heuristic methods (due to exponential # of choices), typically greedy:
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
Attribute Subset Selection (contd…)
Dimensionality Reduction
 Data encoding or transformations are applied so as to obtain a reduced
or ‘compressed’ representation of original data.
Original Data
Original Data
Dimensionality Reduction…
 Data encoding techniques:
Huffman algorithm
Principal components analysis
Numerosity Reduction
 Reduce data volume by choosing alternative, smaller forms of data
 Parametric methods
 Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling