Download Preprocessing data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Preprocessing data
Komate AMPHAWAN
1
Why Preprocess the Data?
2
Low-quality data will lead to lowquality mining results
• How can the data be preprocessed in order to
help improve the quality of the data and,
consequently, of the mining results?
• How can the data be preprocessed so as to
improve the efficiency and ease of the
mining process?
3
Data preprocessing techniques
• Data cleaning can be applied to remove noise
and correct inconsistencies in the data.
• Data integration merges data from multiple
sources into a coherent data store, such as a
data warehouse.
• Data transformations, such as normalization,
may be applied.
• Data reduction can reduce the data size by
aggregating, eliminating redundant features,
or clustering, for instance.
4
Why Preprocess the Data?
• The data you wish to analyze by data mining
techniques are incomplete (lacking attribute
values or certain attributes of interest, or
containing only aggregate data), noisy
(containing errors, or outlier values that
deviate from the expected), and inconsistent
(e.g., containing discrepancies in the
department codes used to categorize items).
5
Descriptive Data Summarization [1]
• For data preprocessing to be successful, it is
essential to have an overall picture of your
data.
• Descriptive data summarization techniques
can be used to identify the typical properties
of your data and highlight which data values
should be treated as noise or outliers.
6
Descriptive Data Summarization [2]
• learn about data characteristics regarding both
central tendency(แนวโนม) and dispersion(การแพรกระจาย) of
the data.
• Measures of central tendency include mean,
median, mode, and midrange.
• Measures of data dispersion include quartiles,
interquartile range (IQR), and variance.
7
Descriptive Data Summarization [3]
• Descriptive statistics are of great help in
understanding the distribution of the data.
 examine how we can be computed efficiently in
large databases
8
Measuring the Central Tendency [1]
• most effective numerical measure of the
“center” of a set of data is the (arithmetic)
mean.
9
Measuring the Central Tendency [2]
• A distributive measure is a function that can
be computed for a given data set by (i)
partitioning the data into smaller subsets, (ii)
computing the measure for each subset, and
then (iii) merging the results in order to
arrive at the measure’s value for the original
(entire) data set.
 sum(), count(), min() and max() are distributive
measures
10
Measuring the Central Tendency [3]
• An algebraic measure is computed by
applying an algebraic function to one or
more distributive measures.
 average (or mean()) is an algebraic measure
because it can be computed by sum()/count().
11
Example
• Weighted arithmetic mean (weighted average)
 Each value xi in a set may be associated with a
weight wi, for i = 1,…,N.
 The weights reflect the significance, importance, or
occurrence frequency attached to their respective
values.
12
Measuring the Central Tendency [4]
• A holistic measure is computed on the entire
data set as a whole. It cannot be computed by
partitioning the given data into subsets and
merging the values obtained for the measure
in each subset.
 median
13
14
Measuring the Central Tendency [5]
• mode is the value that occurs most
frequently in the set.
• It is possible for the greatest frequency to
correspond to several different values, which
results in more than one mode.
• Data sets with one, two, or three modes are
respectively called unimodal, bimodal, and
trimodal.
15
Measuring the Dispersion of Data [1]
• The degree to which numerical data tend to
spread is called the dispersion, or variance of
the data.




Range
the five-number summary (based on quartiles)
the interquartile range
the standard deviation
16
Measuring the Dispersion of Data [2]
• Let x1,x2,…, xN be a set of observations for
some attribute.
• The range of the set is the difference
between the largest (max()) and smallest
(min()) values.
17
Measuring the Dispersion of Data [3]
• Quartiles
 The first quartile, denoted by Q1, is the 25th
percentile; the third quartile, denoted by Q3, is
the 75th percentile.
• Interquartile range (IQR)
18
Measuring the Dispersion of Data [4]
• The five-number summary of a distribution
consists of the median, the quartiles Q1 and
Q3, and the smallest and largest individual
observations, written in the order
{Minimum, Q1, Median, Q3, Maximum}
19
Variance and Standard Deviation
• The variance of N observations, x1,x2,… ,xN, is
• The standard deviation, σ, of the observations
is the square root of the variance σ2.
20
Q&A
21