Download Data Cleaning: The information possessed by many

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Cleaning: The information possessed by many companies contains numerous anomalies
and missing values. If the company wants to analyze their data by using Data Mining, data
cleaning should be performed initially. Otherwise, incorrect patterns generated will be of no
use but an additional burden for the company’s management.
Data Cleaning is an approach to clean the data by correcting errors, filling the missing values
and removing noise from the data. It is one of the most important pre-processing tasks to be
performed on a data before applying data mining techniques. There are many methods for
cleaning the data.

Replacing Missing values: There are many situations where certain columns in the
dataset don’t have any values. These are termed as missing valued tuples. For making
correct predictions, the missing values need to be handled and should be replaced with
an approximate value. They can be done by:
a) Replacing by the average: In this method, the missing value in the numerical
column is replaced by the average of all values in the column present previously.
b) Replacing by a constant: In this method, in the whole dataset all instances of
missing values are replaced by a constant value. For example
c) Replacing Manually: In this method, the missing values are filled manually which
is a not the best option especially for huge datasets.
d) Removing the record: In this method, if the record’s attributes have many
missing values, then removing the record can be a good heuristic.
e) Replacing by an approximate value: In this method, the missing values are filled
with the most probable value of that attribute.
As an example, consider the table name table below having 10 records from the imdb dataset.
Title
Year
ReleaseDate
No of
ratings
IMDB
Rating
18 Wheels
of Justice
2000
12 Jan 2000
14
4.8
24:
Conspiracy
2005
?
8
4.4
2000
15 Aug 2000
17
5.5
2005
5 Jul 2005
12
5.5
2gether:
The Series
The 70’s
Genre
Action,
Crime,
Drama,
Thriller
Action,
Drama,
Shortfilm,
Thriller
Comedy,
Family
X
Director
Cinematograp
her
Actors
Moranville
John B.,
Jack Garrett
Thornhill, Lisa,
Hosea, Bobby,
Gatrick,
Maro,
?
Bryant, Beverly,
Rider, Amy
Gunn, Mark
Winter, Glen
Sizemore,
Frenzel, Guido
James, Brenda,
Smith, Lauren
Lee (X), Leggero,
House
Robert(I)
Documenta
ry
Akhtar,
Kabir
Capodice, Nick
Aldridge, Kelly,
Aldridge, Sabrina
8.2
Drama
de Anda,
Heriberto
Suárez, Juan
Carlos
Almus, Irene,
Ballesteros,
Marita,
Coles, John
th
8 &
Ocean
Amarte
asA
2006
2005
7 Mar 2006
6 Apr 2005
89
6
Natasha,
Stallings
4.9
Aylesworth,
Reiko, Bareikis,
Arija,
American
Embassy,
The
2002
11 Mar 2002
14
9.1
Drama,
Comedy
American
High
2000
2 Aug 2000
11
5.3
?
Cutler, R.J,
Churchill, Joan
Bodle, Kaytee,
Komessar, Allie
American
Juniors
2003
3 Jun 2003
18
1.9
?
Gowers,
Brucesex
?
Dubela, Julie,
Gibson, Deborah
As If
2002
?
19
7.9
Comedy
Grant, Brian
?
Welland, James
Corrie, Emily,
Thoms, Tracie
There are missing values for the three different attributes which are ReleaseDate, Genre and
Cinematographer.
1. If we consider the first attribute of these three, there is a relationship with the other
attribute named year in the table. If we remove the ReleaseDate attribute, there is not
much loss of important information. Hence, in this way useless attributes can be removed
from the table.
2. Considering the attribute Genre which gives information about the type of the movie. If we
remove this attribute, there will be certainly a huge loss of information. Hence, without
removing the attribute the missing values should be filled. As earlier mentioned we can fill
in the missing entries with the most probable option. A naïve method to determine the
most probable option would be to first determine the most probable values in the column
which are ‘Drama’ and ‘Comedy’. In addition one can always use methods like Regression
and Bayesian posterior probability estimates to determine the most probable value here.
3. Considering the cinematographer attribute, we basically have two choices here one being
filling the missing values of this attribute manually. However this might seem to be a
laborious process so it is subjective to the user. Other option might be just to delete this
attribute as in the current table it is of lesser relevance in comparison to the director
attribute.
After applying the changes listed above and cleaning the table as described above, the table
looks like:Title
18 Wheels of
Justice
24:
Conspiracy
2gether: The
Series
No of
ratings
Rating
2000
14
4.8
2005
8
4.4
2000
17
5.5
Year
Genre
Action, Crime,
Drama, Thriller
Action, Drama,
Shortfilm, Thriller
Comedy, Family
Director
Actors
Moranville John B.,
Thornhill, Lisa, Hosea, Bobby,
Gatrick, Maro,
Bryant, Beverly, Rider, Amy
Gunn, Mark
James, Brenda, Smith, Lauren
2005
12
5.5
Comedy
Sizemore, Robert(I)
Lee (X), Leggero, Natasha,
Stallings
8 & Ocean
2006
89
4.9
Documentary
Akhtar, Kabir
Aldridge, Kelly,
Aldridge, Sabrina
Amarte asA
2005
6
8.2
Drama
de Anda, Heriberto
Almus, Irene, Ballesteros,
Marita,
American
Embassy, The
2002
14
9.1
Drama, Comedy
American
High
2000
11
5.3
Comedy
Cutler, R.J,
American
Juniors
2003
18
1.9
Drama
Gowers, Brucesex
Dubela, Julie, Gibson,
Deborah
As If
2002
19
7.9
Comedy
Grant, Brian
Corrie, Emily, Thoms, Tracie
The 70’s
House
th
Coles, John
Aylesworth, Reiko, Bareikis,
Arija,
Bodle, Kaytee, Komessar, Allie
Handling Noisy Data: Noise is very normally present in the raw data. Elimination of noise is
very important without which the knowledge mined does not make much sense. Hence, the
data should be free of noise. There are different approaches for eliminating noise from the
data. Some of them are Binning, Clustering, etc.
1. Binning is a technique which smoothes the data. The different methods of binning
and how they can be used will be explained during data discretization.
Data Transformation: As earlier explained in the introduction section data needs to be
standardized for different applications so that data mining algorithms can be applied on them.
Hence, data is normally transformed to the necessary formats. There are different methods of
transformation. Some of them are:Data Normalization: It is generally observed in few real world numeric datasets that the ranges
of values of few columns are much higher than that of the rest. If the mining algorithm is run
directly over such a dataset it is quite possible that the algorithm may be biased. Normalization
provides us with methods to scale the entire numeric data to the interval [0, 1].
There are different ways of normalization which are
1. Min-Max Normalization: Suppose among the data values of an attribute on which we
want to normalize minA and maxA are the minimum and maximum values respectively
from the attribute considered. Then for the values to fall in the range of [0,1], for all the
values v belonging to the attribute, we can determine another v_new :
v_new = ((v - minA)/(maxA - minA))*(new_maxA – new_minA) +
new_minA;
here new_maxA = 1 and new_minA = 0
2. Z-score Normalization: It is also called zero-mean normalization. The values of attribute
X are normalized using the mean and standard deviation of X. A new value new_v is
obtained using the following
v_new = (v-µ)/standard deviation of X
where µ is the mean of attribute X.
Consider an example which is a sample dataset from the Boston Housing Data
CRIM
ZN
INDUS
NOX
RM
AGE
DIS
RAD
TAX
PT
RATIO
B
LSTAT MEDV
0.00632
18.00
2.310
0.5380
6.5750
65.20
4.0900
1
296.0
15.30
396.90
4.98
24.00
0.08829
12.50
7.870
0.5240
6.0120
66.60
5.5605
5
311.0
15.20
395.60
12.43
22.90
0.14455
0.02763
0.8873
0.15445
0.01951
0.04203
0.08244
0.21409
12.50
75.00
21.00
25.00
17.50
28.00
30.00
22.00
7.870
2.950
5.640
5.130
1.380
15.040
4.930
5.860
0.5240
0.4280
0.4390
0.4530
0.4161
0.4640
0.4280
0.4310
6.1720
6.5950
5.9630
6.1450
7.1040
6.4420
6.4810
6.4380
96.10
21.80
45.70
29.20
59.50
53.60
18.50
8.90
5.9505
5.4011
6.8147
7.8148
9.2229
3.6659
6.1899
7.3967
5
3
4
8
3
4
6
7
311.0
252.0
243.0
284.0
216.0
270.0
300.0
330.0
15.20
18.30
16.80
19.70
18.60
18.20
16.60
19.10
396.90
395.62
395.56
390.68
393.24
395.01
379.41
377.07
19.15
1.98
13.45
6.86
8.05
8.16
6.36
3.59
27.10
34.90
19.70
23.30
33.00
22.90
23.70
24.80
The two attributes which we are going to normalize are: ZN, LSTAT.
a) Considering ZN attribute the different values are: {12.50, 12.50, 17.50, 18.00, 21.00,
22.00, 25.00, 28.00, 30.00, 75.00}. Now we need to scale these values to fall in the
interval of [0,1] using min-max normalization.
b) min_A = 12.50, max_A= 75.00, new_minA = 0, new_maxA = 1;
c) Consider 12.5 now to determine the value by which we want to replace the current
value we use the formula as follows v_new: ((12.50 – 12.50)/(75.00 – 12.50))*(1 – 0) + 0
= 0.
d) In this process, we can determine the new set of values for the ZN column which are
12.50 = 0; 17.50 = 0.08; 18.00 = 0.088; 21.00 = 0.136; 22.00 = 0.152; 25.00 = 0.2; 28.00 =
0.248; 30.00 = 0.28; 75.00 = 1;
Considering the LSTAT attribute the different values in this column are: {1.98, 3.59, 4.98, 6.36,
6.86, 8.05, 8.16, 12.43, 13.45, 19.15}. If we want to normalize them using min max
normalization, the same procedure as above can be applied and the values are mapped to their
corresponding new values:
1.98 = 0; 3.59=0.097; 4.98=0.1747; 6.36=0.255;6.86=0.2842; 8.05=0.3535; 8.16=0.3599;
12.43=0.6086; 13.45=0.668; 19.15=1;
The normalized table looks like:
CRIM
ZN
INDUS
NOX
RM
AGE
DIS
RAD
TAX
PT
RATIO
B
LSTA
T
MEDV
0.00632
0.088
2.310
0.5380
6.5750
65.20
4.0900
1
296.0
15.30
396.90
0.1747
24.00
0.08829
0
7.870
0.5240
6.0120
66.60
5.5605
5
311.0
15.20
395.60
0.6086
22.90
0.14455
0.02763
0.8873
0.15445
0.01951
0.04203
0.08244
0.21409
0
1
0.136
0.2
0.08
0.248
0.28
0.152
7.870
2.950
5.640
5.130
1.380
15.040
4.930
5.860
0.5240
0.4280
0.4390
0.4530
0.4161
0.4640
0.4280
0.4310
6.1720
6.5950
5.9630
6.1450
7.1040
6.4420
6.4810
6.4380
96.10
21.80
45.70
29.20
59.50
53.60
18.50
8.90
5.9505
5.4011
6.8147
7.8148
9.2229
3.6659
6.1899
7.3967
5
3
4
8
3
4
6
7
311.0
252.0
243.0
284.0
216.0
270.0
300.0
330.0
15.20
18.30
16.80
19.70
18.60
18.20
16.60
19.10
396.90
395.62
395.56
390.68
393.24
395.01
379.41
377.07
1
0
0.668
0.2842
0.3535
0.3599
0.255
0.097
27.10
34.90
19.70
23.30
33.00
22.90
23.70
24.80
Data Discretization: The process of data discretization can be defined as one where we convert
few columns of the dataset from numerical to categorical. This is done to facilitate the working
of algorithms which work alone on totally categorical datasets. For example in this scenario ten
rows from the movies table of the imdb dataset have been considered. We will describe this
process in two stages here firstly we will show you the 10 samples from the movies table and
mark in bold the attribute we are going to discretize.
Example:IMBD Releasedate
rating
Title
Year
Genre
Actors
4.8
12 Jan 2000
?
2000
Action, crime,
drama
Moranville, John B., Radler,
Robert, Satlof,
4.4
?
Conspiracy
2005
Action, drama ,
thriller
Ostrick, Marc, Young, Eric (VII),
Young, Eric Nealsex
5.5
5 July 2005
70’s House
2005
?
Sizemore, Robert Frenzel, Guido,
Taylor, Mike L
5.5
15 August 2000
2gether: The Series
2000
Comedy, family
Gunn, Mark, Lazarus, Paul, Pozer
4.9
7 March 2006
8th & Ocean
2006
documentary
Akhtar, Kabir, Barrett, Kasey,
Johnson, Jaymee
8.2
6 April 2005
Amarte Asa
2005
drama
Moser, Alejandro Hugo, Juan
Carlos Almus
9.1
11 March 2002
American Embassy,
The
2002
Drama, comedy
Coles, John David, Cragg, Stephen,
Surjik, Stephensex
5.4
2 August 2000
American High
2000
?
Ellwood, Alison, Partland, Dan
Chinn
1.9
3 June 2003
American Juniors
2003
?
Gowers, Brucesex
7.9
?
As if
2002
comedy
Grant Brian , Meyers, Simon Stok,
Witold Corrie
Equi frequency binning
Equi width binning
Bin1=> ‘A’ = (1.9, 4.4, 4.8)
Bin1=> ‘a’= () *<2.5+
Bin2=> ‘B’ = (4.9, 5.4, 5.5)
Bin2=> ‘b’=() *2.5-5]
Bin3=>’C’ = (7.9, 8.2, 9.1)
Bin3=> ‘c’ = () *5-7.5]
Bin4=> ‘d’ = () *7.5-10]
IMBD Release date
rating
Movie name
Year of Genre
release
Actors
A
12 Jan 2000
18 Wheels of Justice
2000
Action, crime,
drama
Moranville, John B., Radler,
Robert, Satlof,
A
?
Conspiracy
2005
Action, drama ,
thriller
Ostrick, Marc, Young, Eric (VII),
Young, Eric Nealsex
B
5 July 2005
70’s House
2005
?
Sizemore, Robert Frenzel, Guido,
Taylor, Mike L
B
15 August 2000
2gether: The Series
2000
Comedy, family
Gunn, Mark, Lazarus, Paul, Pozer
B
7 March 2006
8th & Ocean
2006
documentary
Akhtar, Kabir, Barrett, Kasey,
Johnson, Jaymee
C
6 April 2005
Amarte Asa
2005
drama
Moser, Alejandro Hugo, Juan
Carlos Almus
C
11 March 2002
American Embassy,
The
2002
Drama, comedy
Coles, John David, Cragg, Stephen,
Surjik, Stephensex
B
2 August 2000
American High
2000
?
Ellwood, Alison, Partland, Dan
Chinn
A
3 June 2003
American Juniors
2003
?
Gowers, Brucesex
C
?
As if
2002
comedy
Grant Brian , Meyers, Simon Stok,
Witold Corrie
Data Generalization: Precise data always helps in providing the overall picture of the data by
providing the better information. Large data sets can be summarized concisely according to the
requirement. Data Generalization is the process of transforming data in a database from a low
conceptual level to a higher conceptual level. For example, consider the Census Dataset, whose
sub-table looks like:Age
Workclass
Education
MaritalStatus
Race
Sex
39
State-gov
Bachelors
Never-Married
White
Male
52
Self-emp-not-inc
HS-grad
Married-Civspouse
White
Male
32
Private
Assoc-acdm
Never-Married
Black
Male
49
Private
HS-grad
Seperated
White
Female
57
Federal-gov
Bachelors
Married-Civspouse
Black
Male
25
Private
Some-College
Married-CivSpouse
Other
Female
30
Private
HS-grad
Married-CivSpouse
Asian-PacIslander
Female
48
Self-emp-not-Inc
Some-college
Married-CivSpouse
Amer-IndianEskimo
Male
18
Never-Worked
10
Never-Married
White
Male
33
State-gov
Some-college
Divorced
Black
Female
th
When data generalization is applied on the Age attribute, Age being generalized as:
Age: 0-12: Children; 13-19: Teenage; 20-30: Adolescent-Age; 30-50: Middle-Age; >50: Old-Age;
The transformed table is:
Age
Workclass
Education
MaritalStatus
Race
Sex
Middle-Age
State-gov
Bachelors
NeverMarried
White
Male
Old-Age
Self-emp-not-inc
HS-grad
Married-Civspouse
White
Male
Middle-Age
Private
Assoc-acdm
NeverMarried
Black
Male
Middle-Age
Private
HS-grad
Seperated
White
Female
Old-Age
Federal-gov
Bachelors
Married-Civspouse
Black
Male
Adolescent Age
Private
Some-College
Married-CivSpouse
Other
Female
Adolescent Age
Private
HS-grad
Married-CivSpouse
Asian-PacIslander
Female
Married-CivSpouse
Amer-IndianEskimo
Male
10
NeverMarried
White
Male
Some-college
Divorced
Black
Female
Middle-Age
Self-emp-not-Inc
Some-college
Teenage
Never-Worked
Middle-Age
State-gov
th
Data Aggregation: Data Aggregation is the process of summarizing the information which is
mainly used for the statistical analysis. This is particularly used in marketing scenario where the
marketer wants to analyze the purchase details made by the customers in a particular duration
and from the analysis, tries to improvise their selling techniques and products. Best example is
figuring out the products being sold in a particular time period and introducing the wide variety
of the most-sold item or providing some offers on those kind of products.
Data Reduction: Real world data as such is highly diverse and therefore it needs to be simplified
before mining it. Data discretization itself is one method of data reduction. In addition to this
there are many different kinds of methods in which data can be reduced which are
1) Numerosity reduction: In numerosity reduction, the data are replaced or estimated by
alternative smaller data representations such as parametric models or non parametric
models such as clustering, sampling and usage of histograms.
2) Dimensionality reduction: Datasets may contain thousands of features. Not all of them are
equally important for a task at hand and some of the attributes may be quite irrelevant. For
example, if we are constructing a dataset for prediction of a disease based on several
features such as body_mass_index, temperature, etc., then feature telephone_number is
likely irrelevant. A straightforward goal of dimensionality reduction is to find the best
subset of features with respect to the prediction accuracy of the algorithm. If prediction
accuracy is similar to or better than the original accuracy the removed features are said to
be irrelevant.
3) Clustering: This can be used to group objects based on a similarity function to reduce the
size of huge datasets into groups of interrelated objects.
4) Sampling: Sampling is a typical numerosity reduction technique. There are several ways to
construct a sample:
a) Simple random sampling without replacement – performed by randomly choosing n1
data points such that n1 < n. Recall, n is the number of data points in the original dataset
D.
b) Simple random sampling with replacement – we are selecting n1 < n data points, but
draw them one at a time (n1 times). In such a way, one data point can be drawn
multiple times in the same subsample.
c) Cluster sample – examples in D are originally grouped into M disjoint clusters. Then a
simple random sample of m < M elements can be drawn.
d) Stratified sample – D is originally divided into disjoint parts called strata. Then, a
stratified sample of D is generated by obtaining a simple random sample at each
stratum. This helps getting a representative sample especially when the data is skewed
(say, many more examples of class 0 then of class 1). Stratified samples can be
proportionate and disproportionate