Download Chapter 3 Data Preprocessing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Preprocessing
This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul , Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed
under a Creative Commons Attribution 4.0 International License.



A manager at All Electronics and have been charged with
analyzing the company's data with respect to the sales at a
branch. He carefully inspect the company's database and data
warehouse, identifying dimensions to be included, such as
item, price, and units sold.
He notice that several of the attributes for various tuples have
no recorded value. For analysis, he would like to include
information.
In other words, the data he wish to analyze by data mining
techniques is incomplete, noisy and inconsistent.
2


Data cleaning routines work to "clean" the data by filling in
missing values, smoothing noisy data. If users believe the data
are dirty, they are unlikely to trust the results. Furthermore,
dirty data can cause confusion for the mining procedure,
resulting in unreliable output.
Although most mining routines have some procedures for
dealing with incomplete data, they are not always robust.
Instead, they may concentrate on avoiding over fitting the data
to the function being modeled. Therefore, a useful step is to
run your data through some data cleaning routines.
3


Yet some attributes representing a given concept may have
different names in different databases, causing inconsistencies
and redundancies.
For example:
The attribute for customer identification may be referred to as
customer id in one data store and cust id in another.
The same first name could be registered as "Bill" in one
database, but "William" in another, and "B" in the third.
Having a large amount of redundant data may slow down or
confuse the knowledge discovery process.
4

Getting back to your data, you have decided that you would
like to use a mining algorithm for your analysis, such as
neural networks, nearest-neighbor classifiers clustering. Such
methods provide better results if the data to be analyzed have
been normalized.

Example:-contain the attributes age and annual salary. The
annual salary attribute usually takes much larger values than
age. If the attributes are left un-normalized, the distance
measurements taken on annual salary will generally outweigh
,distance measurements taken on age.
5
There are a number of strategies for data reduction. These
include data aggregation building a data cube), attribute
subset selection (removing irrelevant attributes through
correlation analysis), dimensionality reduction (using
encoding schemes such as minimum length encoding), and
numerosity reduction ("replacing" the data by alternative,
smaller representations). Data can also be "reduced" by
generalization with the use of concept hierarchies, where
low-level concepts, such as city for customer location, are
replaced with higher-level concepts, such as region or state.
6
Data Cleaning
Data Integration
Data Transformation
-2,32,100,59,48-0.02,0.32,1.00,0.59,0.48
Data Reduction
7
Real-world data tend to be incomplete, noisy, and
inconsistent. Data cleaning routines attempt to fill in
missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.

Missing Values:
Ignore the tuple: This is usually done when the class label is
missing. This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially
poor when the percentage of missing values per attribute varies
considerably.
8


Fill in the missing value manually: In general, this
approach is time-consuming and may not be feasible
given a large data set with many missing values.
Use a global constant to fill in the missing value:
Replace all missing attribute values by the same
constant, such as a label like "Unknown" or . If
missing values are replaced by, say, "Unknown," then
the mining program may mistakenly think that they
form an interesting concept, since they all have a
value in common—that of "Unknown." Hence,
although this method is simple, it is not foolproof.
9



Use the attribute mean to fill in the missing value: For
example, suppose that the average income of AllElectronics
customers is $56,000. Use this value to replace the missing
value for income.
Use the attribute mean for all samples belonging to the
same class as the given tuple: For example, if classifying
customers according to credit risk, replace the missing value
with the average income value for customers in the same
credit risk category as that of the given tuple.
Use the most probable value to fill in the missing value:
This may be determined with regression, inference-based
tools using a Bayesian formalism, or decision tree
induction. For example, using the other customer attributes
in your data set, you may construct a decision tree to predict
the missing values for income.
10
Noise is a random error in a measured variable. Given a
numerical attribute such as say, price, now "smooth" out the
data to remove the noise?
Sorted data for price (dollars): 4, 8, 15, 21, 21, 24, 25, 28,
34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34
11
Binning:- Binning methods smooth a sorted data value by
consulting its "neighborhood," that is, the values around it.
The sorted values are distributed into a no. of "buckets,".
Clustering:- Similar values are organized into groups. Values
that fall outside of the set of clusters may be considered
outliers.
Regression:- Data can be smoothed by fitting the data with
regression.
Linear regression involves finding the "best" line to fit
two attributes, so that one attribute can be used to predict .
Multiple linear regression is an extension of linear
regression, where more than two attributes are involved &
data are fit to a multidimensional surface .
Combined computer & Human Inspection:-Outlier may be
identified through a combination of both human and
computer.
12
Data mining often requires data integration—the merging of
data from multiple data stores.
The data may also need to be transformed into forms
appropriate for mining.
Data Integration:-It is likely that your data analysis task will
involve data integration, which combines data from
multiple sources into a coherent data store, as in data
warehousing..
13
Thank you
14