Download Data preprocessing - alite-test

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
EXPERIMENT NO :1
AIM: 2 Tutorials
a) Solving exercises in Data Exploration
b) Solving exercises in Data preprocessing
THEORY:
DATA EXPLORATION:
Data exploration is a common process in data warehouses which are characterized by large bulks of
data coming from disparate systems. Since these systems can have data in different formats and
sources also, when data converge in a central data warehouses, it may be a very difficult task to get
the relevant data needed for statistical reporting as well as trend and pattern spotting.
Data exploration helps a data consumer focus an information search on the pertinent aspect of
relevant data before true analysis can be achieved. In large data sets, data is not gathered or
controlled in a focused manner. Even in smaller data sets, it is also true that data gathered are not
in a very rigid and specific technique can result in a disorganized manner and a myriad of subsets
each.
In most cases, without a set of techniques, narrowing an information search may cause several
problems because one may lost important perspectives of the relevant data among the myriad of
sets of unrelated data.
There are generally two methodologies one can have to get relevant data from huge data sources or
sets. These are manual and automatic techniques. They are more commonly known as data mining
for automatic and data exploration for manual. Although they are categorized as such, these terms
are not really well defined in the real IT sense.
Data Mining, along with its near relative, data prospecting, has a wide variety of usage and has
been considered by many as a very abused term in everyday usage. Some people consider it as
synonymous with data analysis although many believe that they are technically different.
Data mining is a methodology commonly used on very large datasets. In fact, they are used in
entire databases running a data warehouse. A common definition of data mining is "the nontrivial
extraction of implicit, previously unknown, and potentially useful information from data" or "the
science of extracting useful information from large data sets or databases". Although data mining is
guided by a human being specifying some parameters, it is an automated algorithm handling the
mechanism to carry out the search.
On the other hand, Data Exploration is methodology using manual techniques in order for data user
to find his way through large bulks of data and bring important and relevant data to be focused and
utilized for analysis. The methodology may apply to data of any type or size but because of its
manual nature, many opt to use data exploration for smaller data sets.
Data preprocessing
Why preprocessing ?
1. Real world data are generally
o Incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
o Noisy: containing errors or outliers
o Inconsistent: containing discrepancies in codes or names
2. Tasks in data preprocessing
o Data cleaning: fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies.
o Data integration: using multiple databases, data cubes, or files.
o Data transformation: normalization and aggregation.
o Data reduction: reducing the volume but producing the same or similar
analytical results.
o Data discretization: part of data reduction, replacing numerical attributes
with nominal ones.
Data cleaning
1. Fill in missing values (attribute or class value):
o Ignore the tuple: usually done when class label is missing.
o Use the attribute mean (or majority nominal value) to fill in the missing
value.
o Use the attribute mean (or majority nominal value) for all samples
belonging to the same class.
Predict the missing value by using a learning algorithm: consider the
attribute with the missing value as a dependent (class) variable and run a
learning algorithm (usually Bayes or decision tree) to predict the missing
value.
2. Identify outliers and smooth out noisy data:
o Binning
 Sort the attribute values and partition them into bins (see
"Unsupervised discretization" below);
 Then smooth by bin means, bin median, or bin boundaries.
o Clustering: group values in clusters and then detect and remove outliers
(automatic or manual)
o Regression: smooth by fitting the data into regression functions.
3. Correct inconsistent data: use domain knowledge or expert decision.
o
Data transformation
1. Normalization:
o Scaling attribute values to fall within a specified range.
 Example: to transform V in [min, max] to V' in [0,1], apply V'=(VMin)/(Max-Min)
Scaling by using mean and standard deviation (useful when min and max
are unknown or when there are outliers): V'=(V-Mean)/StDev
2. Aggregation: moving up in the concept hierarchy on numeric attributes.
3. Generalization: moving up in the concept hierarchy on nominal attributes.
4. Attribute construction: replacing or adding new attributes inferred by existing
attributes.
o
Data reduction
1. Reducing the number of attributes
o Data cube aggregation: applying roll-up, slice or dice operations.
o Removing irrelevant attributes: attribute selection (filtering and wrapper
methods), searching the attribute space (see Lecture 5: Attribute-oriented
analysis).
o Principle component analysis (numeric attributes only): searching for a
lower dimensional space that can best represent the data..
2. Reducing the number of attribute values
o Binning (histograms): reducing the number of attributes by grouping
them into intervals (bins).
o Clustering: grouping values in clusters.
o Aggregation or generalization
3. Reducing the number of tuples
o Sampling
Discretization and generating concept hierarchies
1. Unsupervised discretization - class variable is not used.
o Equal-interval (equiwidth) binning: split the whole range of numbers in
intervals with equal size.
o Equal-frequency (equidepth) binning: use intervals containing equal
number of values.
2. Supervised discretization - uses the values of the class variable.
o Using class boundaries. Three steps:
 Sort values.
 Place breakpoints between values belonging to different classes.
 If too many intervals, merge intervals with equal or similar class
distributions.
o Entropy (information)-based discretization. Example:
 Information in a class distribution:
 Denote a set of five values occurring in tuples belonging to
two classes (+ and -) as [+,+,+,-,-]
 That is, the first 3 belong to "+" tuples and the last 2 - to "-"
tuples
 Then, Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5) (logs are base 2)
 3/5 and 2/5 are relative frequencies (probabilities)
 Ignoring the order of the values, we can use the following
notation: [3,2] meaning 3 values from one class and 2 - from
the other.



Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)
Information in a split (2/5 and 3/5 are weight coefficients):
 Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) + (3/5)*Info([+,-,-])
 Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])
Method:
 Sort the values;
 Calculate information in all possible splits;
 Choose the split that minimizes information;
 Do not include breakpoints between values belonging to the
same class (this will increase information);
 Apply the same to the resulting intervals until some
stopping criterion is satisfied.
CONCLUSION: Thus we have learnt about the Data Exploration & Data
Preprocessing techniques.