Download + .. + x

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data mining wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Week 2 lecture slides
❙ Topics
❙ Data Preprocessing
❙ Descriptive Data Summarization
❙ Data Preprocessing Tasks
❙ Textbook reference: Chapter 2
Data preprocessing
❙
Preprocessing is required because typically real-world
data sets are incomplete, noisy and inconsistent.
❙
It is probably sourced from different databases that may
use different attribute names for essentially the same
thing.
❙
Some attribute values may be missing.
❙
Some have been incorrectly entered, lost or corrupted.
❙
The data set may be far larger than necessary.
❙
It may be necessary to normalize, bin or cull some
attributes or some outlier records.
❙
The goal is to get a data set to which data mining
techniques can be applied effeciently and effectively.
Descriptive data summarization
❙
These techniques provide an overall view or profile of the
data.
❙
They can be used to identify problems, such as the
presence of outliers.
❙
They describe the:
❙ Central tendency of the data,
❙ Dispersion of the data.
Central tendency measures
Let the attribute values be
❙
x1, x2, .. , xn
The arithmetic mean of is the average
mean = (x1+ x2 + .. + xn) / n
❙
The median is the middle value of the ordered set.
❙
The mode is the most frequently occuring value. There may
be more than one (a bimodal, or multimodal data set).
Dispersion measures
❙
The range is
❙
Quartiles divide the data into 4 quarters.
max(x1, x2, .. , xn) - min(x1, x2, .. , xn)
Q1 is xi such that 25% of x1, x2, .. , xn are less than or equal to xi
❙
The interquartile range is
❙
The five-number summary of distribution is
Q3 - Q1
min(x1, x2, .. , xn), Q1, median, Q3, max(x1, x2, .. , xn)
❙
Boxplots are illustrations of the five-number summary.
Min --- Q1 median Q3 --- Max
❙
Variance is
❙
Standard Deviation is the square root of variance.
((x1- mean)2 + .. + (xn - mean)2) / n
Data cleaning: incomplete data
Missing value strategies:
❙ Omit the whole record.
May be losing significant data.
❙ Find the value.
Probably impractical.
❙ Use a global constant to mean “unknown”.
May interfere with knowledge discovery algorithms.
❙ Use the mean of the known values.
Skews the data.
❙ Classify the records with known values and use the
class mean.
Better, but still skews the data.
❙ Construct a predictive model and use the estimate.
Data cleaning: noisy data
Noisy data has random errors or variance. A good example
is sensor data that is effected by the background or even
the sensing instrument itself.
Strategies for dealing with noise:
❙ Smooth by binning, that is change the values to those
representing a partitioning of the ordered data set,
such as the mean, median or bin boundary value.
Rolling up through a concept heirarchy is an example.
❙ Smoothing by regression. Linear regression finds a line
of best fit for the data. Values are changed to fit the
line.
Data cleaning: removing outliers
Outliers are data objects that don’t comply with the general
behaviour of the data set. They will skew the results of
knowledge discovery algorithms (such as classification).
Outliers could be identified by attribute values :
❙ More than about 1.5 times (Q3 – Q1) outside the quartiles;
❙ With a deviation from the mean of more than about 2.5
times the standard deviation.
❙ Cluster analysis. Data objects that don’t conveniently fit
in the clusters encompassing the large majority of the
data.
Data integration
Data integration is combining of data from different sources.
For example, two different databases may have attributes
customer_id and cust_id. Do they mean the same thing?
The metadata of a database is the data about the data. It
should describe the meaning of the attribute names.
Attribute values may be of different data types. e.g. 1 or 1.0,
or of different levels in a concept hierarchy. e.g. $3500 or
Expensive.
The goal is to retain as much information as possible in the
combined data set.
Data reduction
Reducing the size of the data set will mean that knowledge
discovery algorithms will run more efficiently. Again, the goal
should be to ensure that no significant information is lost.
Strategies for data reduction:
❙ Aggregation, that is, summarizing the data. Data Cubes.
A lot more about Data Cubes later.
❙ Attribute reduction. Removing irrelevant attributes.
❙ Dimensionality Reduction. Encoding or compression that
transforms the data set into one with fewer attributes.
❙ Numerosity Reduction. Using bin representations or model
parameters instead of the actual values. Using only a
sample of the whole data. Rollup through the concept
hierarchy. Normalizing attribute values.
Data reduction: attribute correlation.
If two attributes in a data set have a strong correlation, then
one could be removed without significant loss of
information.
Let n be number of records, ai and bi be the numerical values
for attributes A and B in record i, and mA and mB be the
respective means, and sA and sB be the respective standard
deviations for A and B.
Then the correlation between the attributes is given by:
CAB = ( SUMi=1..n ai x bi - n x mA x mB ) / n x sA x sB
Note that
If
CAB = 0
-1 <= CAB <= +1
then the attributes are completely independent.
Chi-square analysis is used to test the correlation of attributes
with categorical values.
Data reduction: principal component
analysis.
Assume a data set of m attributes.
This could be viewed as points in an m-dimensional space,
with each attribute represesnted by an axis.
Principal component analysis attempts to map this space onto
a smaller k-dimensional space, this reducing the number of
attributes required to represent the data set.
The Karhunen-Loeve algorithm finds components (new
attributes) in order of variance among the data. That is, the
first shows the most variance, the second the next most etc.
Using only the components with the strongest variance, a good
(and smaller) approximation of the data can be constructed.
Data reduction: wavelet
transformations.
Assume a data set of m=2p attributes.
Each record can be considered an m-dimensional vector
(x1, x2, .. , xn)
The Discrete Wavelet Transform consists of a low-frequency
(smoothing) filter and a high-frequency (difference) filter.
Each filter operates on value pairs
series of length m/2.
(xi, xi+1)
and produces a
The filters are recursively applied to previous outputs.
Selected output values are called wavelet coeficients.
They represent a compressed version of the original data.
Wavelet transformations conserve local information (about the
relationship between neighbouring values).
The wavelet transformation.
X is the original vector.
D is the low-frequency
filter output.
C is the high-frequency
filter output.
E is the sum of the
values (the energies).
E.g. The energies at
each level could be
characteristic of classes
of data objects, that is,
used for classification.
Example of dimensionality reduction
using the Haar wavelet transformation
Assume a record has values for 8 attributes.
It can be represented as a vector in an 8-dimensional space.
e.g.
6, 4, 4, 4, 8, 6, 8, 4
The low pass filter computes the average sum of pairs.
The high pass filter computes the average difference of pairs.
1st low pass: 5, 4, 7, 6
1st high pass: 1, 0, 1, 2
2nd LP: .5, 1.5 2nd HP: .5, -.5
3rd HP: 0 3rd LP .5
Note: LP and HP are decimation filters.
They reduce the size of the data at each step.
Data reduction: discretization.
Discretization represents continuous values by an interval, group
or concept abstraction level.
Binning. Divide the range into equal width or equal frequency
intervals. Replace values by the bin mean or median.

Histogram analysis represents the frequency of values. Buckets
are merged histograms. Replace values by their frequency or
frequency range.


Clustering groups similar values.
Entrophy-based discretization works on data sets with a class
attribute to form a concept hierarchy


ChiMerge is a bottom-up method that uses the Chi-square test.
Binning examples
Data series: 1,1,3,4,4,4,5,5,5,6,6,7,8,8,8,9,10
Equal-width bin...
Bin 1: 1,1
Bin 2: 3,4,4,4
Bin 3: 5,5,5,6,6
Bin 4: 7,8,8,8
Bin 5: 9,10
Represent values by:
❙ Bin number
❙ Bin mean
❙ Bin boundary value
Frequency Histograms
Numerosity reduction.
Numerosity reduction reduces the size or even the range of
attribute values.


Logarithms. That is, transform using the log function.
e.g. Log(100) = 2, Log(1000) = 3
Normalization
e.g. Z-score normalization.
Let values of attribute A have mean m
and standard deviation s.
Then a value v is normalized as
u = (v – m) / s