Download DATA PREPROCESSING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
DATA PREPROCESSING
Tzompanaki Katerina
Background: Data storage formats
•  Data can be in DBMS
•  ODBC, JDBC protocols
•  Data in a flat file
•  Fixed-width format (each column has a specific number of
characters, filled in with special characters if needed)
•  Delimited format: tab, comma “,”, other
•  Attention: Convert field delimiters inside strings
•  Verify the number of attributes before and after convertion
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
1
Background: Data and attributes
•  Some frequently encountered terminology:
•  Data objects are also called data points, samples, examples,
vectors, instances, or data tuples. They are entities in a given
context in a given dataset, eg patients, products etc.
•  Attributes are also called features, variables, dimensions.
•  Attribute vector is a set of attributes used to describe a given data
object. Eg., the attribute vector <Name, Disease, Prescription>
describes patient data objects.
•  Observed values for attributes are called observations. Eg, cancer,
high blood pressure, flu may be the observations for the disease
attribute in a given dataset.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
2
Background: Attribute types
•  Nominal (or categorical) attributes refer to names of things, or
categories that normally have no order. E.g., marital status (single,
married, divorced), color (blue, green, etc) or userID (323, 235,etc).
•  Binary attribute is a nominal attribute with two possible values: 0 or 1
stating absence or precense. Eg, for a patient we could have the
following binary attributes: smoker (yes, no), sex (male,female), test
(positive, negative).
•  Ordinal attribute is an attribute whose values have an ordering or
ranking. Eg., grades (A>B>C), sizes (large>medium>small)
Qualitative attributes: they describe a feature of an object without giving an
actual size or quantity.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
3
Background: Attribute types
•  Numeric attributes are used to describe measurable quanities and are
represented using numbers (integers or reals). They provide a
ranking and allow for mathematical operations. Eg, temperature
(20°C-15°C), age (44 years old is 2 times older than 22 years old) etc.
Quantitative attributes: they describe measurable quantities.
u  Another categorisation:
u Discrete attributes have a finite or countably infinite set of values, which
may or may not be represented as integers. Eg, hair color, smoker, size,
age, etc.
u Continuous attributes are attributes that are not discrete, thus can be
represented as numbers with floating points. Eg, length, income, price, etc.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
4
Background: Basic Statistical Descriptions of Data
Measures of central tendency
•  The mean of an attribute x in a multi-set of N observations, is the central value.
N
∑x
x=
i=1
N
i
=
x1 +!+ x N
N
•  The median is the middle value in an order set of values. If the number of values is
even the median is not unique. The median better represents skewed data (not
symmetric) and is less sensitive to outliers.
•  The mode is the most frequent value. If several values have the highest frequency
then we talk about multimodal datasets. Can also be used for nomimal attributes.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
5
Background: Basic Statistical Descriptions of Data
Measures of data dispersion
•  The range of a numeric attribute is the difference of the maximum
and the minimum observation (max()-min()).
•  The quantiles separate an ordered numerical set into equal size
(containing the same fraction of data) sub-sets. The kth q-quantile
for a given data distribution is the value v such that at most k/q of
the data values are less than v and at most (q-k)/q of the data
values are more than v, where k is an integer such that 0 <k <q.
The 100-th quantile is called percentile.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
6
Background: Basic Statistical Descriptions of Data
Measures of data dispersion
•  The variance (σ2) and standard deviation (σ) indicate how spread out the distribution
of an attribute x is. A low standard deviation means that the observations tend to be
very close to the mean, while a high standard deviation indicates that the
observations are spread out over a large range of values.
1 N
σ (x) =
(xi − x)2
∑
N −1 i=1
2
•  The covariance cov(x,y) of two attributes shows how correlated the attributes are. A
positive covariance cov(x,y)>0 shows that y raises as x increases while a negative
one cov(x,y)<0 indicates that y decreases while x increases.
1 N
cov(x, y) =
∑(xi − x)(yi − y)
N −1 i=1
•  Finally we define the covariance matrix for x,y (can be extended to cover all data
variables):
x
20/2/17
y
x σ2(x)
cov(x,y)
y cov(y,x)
σ2(y)
Note that
cov(x,y)=cov(y,x)
(diagonal, square matrix)
Tzompanaki Katerina - University of Cergy Pontoise
7
Background: Displaying Data
•  Histograms are used to summarize the distribution of observations.
Each bar represents the frequency of the observation. For ordered
numeric values, we split the range into equally sized buckets. The
range of a bucket is called width.
•  Scatter plots are used to observe correlations between pairs of
numeric attributes.
Positive (left) and
negative (right)
correlation.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
8
Why Preprocessing?
•  Data in the real world is dirty
•  incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
•  noisy: containing errors or outliers
•  inconsistent: containing discrepancies in codes or names
•  No quality data, no quality mining results!
•  Quality decisions must be based on quality data
•  Data warehouse needs consistent integration of quality data
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
9
Data understanding: Relevance
•  What data are available for the task?
•  Are these data relevant?
•  Are additional relevant data available?
•  How much historical data are available?
•  Who is the data expert ?
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
10
Data understanding: Quantity
•  Number of instances (records, objects)
•  Rule of thumb: 5,000 or more desired
•  if less, results are less reliable; use special methods (like boostingnot covered in this course)
•  Number of attributes
•  Rule of thumb: for each attribute, 10 or more instances
•  If more fields, use feature reduction and selection
•  Number of targets
•  Rule of thumb: >100 for each class
•  If very unbalanced, use stratified sampling
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
11
Forms of data preprocessing
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
12
Major Tasks in Data Preprocessing
•  Data cleaning
•  Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
•  Data integration
•  Integration of multiple databases, data cubes, or files
•  Data transformation
•  Normalization and aggregation
•  Data reduction
•  Obtains reduced representation in volume but produces the same
or similar analytical results
•  Data discretization
•  Part of data reduction but with particular importance, especially for
numerical data
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
13
Major Tasks in Data Preprocessing
•  Data cleaning
•  Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
•  Data integration
•  Integration of multiple databases, data cubes, or files
•  Data transformation
•  Normalization and aggregation
•  Data reduction
•  Obtains reduced representation in volume but produces the same
or similar analytical results
•  Data discretization
•  Part of data reduction but with particular importance, especially for
numerical data
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
14
Data Cleaning
•  Reformat data
•  Fill in missing values
•  Handle dates
•  Convert data
•  Identify outliers and smooth out noisy data
•  Correct inconsistent data
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
15
Reformatting Data
Convert data to a standard format
•  Missing values
•  Unified date format
•  Binning of numeric data
•  Fix errors and outliers
•  Convert nominal fields whose values have order to
numeric.
•  Why? A: to be able to use “>”and “<“comparisons on these fields)
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
16
Missing Data
•  Data is not always available
•  E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
•  Missing data may be due to
•  equipment malfunction
•  inconsistent with other recorded data and thus deleted
•  data not entered due to misunderstanding
•  certain data may not be considered important at the time of entry
•  not register history or changes of the data
•  Missing data may need to be inferred
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
17
Handling Missing Data
•  Ignore the tuple: usually done in classification tasks
when the tuple’s class label (target value) is
missing
•  Fill in the missing value manually
•  Use a global constant to fill in the missing value
•  Measure of central tendency: use the attribute
mean/median to fill in the missing value, or the
attribute mean for all samples belonging to the
same class
•  Use the most probable value to fill in the missing
value: in a supervised manner, find the most
possible value using inference-based mechanisms
such as Bayesian formula or decision tree
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
18
Handling Missing Data
•  Ignore the tuple: usually done in classification tasks Ineffective
when the tuple’s class label (target value) is
missing
Inefficient and tedius
•  Fill in the missing value manually
•  Use a global constant to fill in the missing value
Not foolproof
•  Measure of central tendency: use the attribute
mean/median to fill in the missing value, or the
attribute mean for all samples belonging to the
same class
Smarter
•  Use the most probable value to fill in the missing
Best choice
value: in a supervised manner, find the most
possible value using inference-based mechanisms
such as a Bayesian formula or decision tree
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
19
Unified Date Format
•  We want to transform all dates to the same format internally
•  Some systems accept dates in many formats
•  e.g. “Sep 24, 2003”, 9/24/03, 24.09.03, etc
•  dates are transformed internally to a standard value
•  Frequently, just the year (YYYY) is sufficient
•  For more details, we may need the month, the day, the hour, etc
•  Representing date as YYYYMM or YYYYMMDD can be OK, but has
problems
•  What are the problems with YYYYMMDD dates?
•  YYYYMMDD does not preserve intervals:
•  20040201 -20040131 ≠ 20040131 –20040130
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
20
Unified Date Format Options
•  To preserve intervals, we can use
•  Unix system date: Number of seconds since Jan 1, 1970
•  Number of days since Jan 1, 1960 (SAS)
•  Problem:
•  values are non-obvious
•  don’t help intuition and knowledge discovery
•  harder to verify, easier to make an error
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
21
KSP Date Format
KSP _ Date = YYYY +
days _ starting _1_ Jan − 0.5
365 +1_ if _ leap _ year
•  Preserves intervals between days
•  The year is obvious
•  Sep 24, 2003 is 2003 + (267-0.5)/365= 2003.7301 (round to 4
digits)
•  Can be extended to include time
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
22
Conversion: Nominal to Numeric
•  Some methods can deal with nominal values internally.
•  Other methods (regression, nearest neighbor, neural
networks) require only numeric inputs.
•  To use nominal fields in such methods we need to convert
them to a numeric value.
•  Different strategies for binary, ordered, multi-valued
nominal fields.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
23
Conversion: Binary to Numeric
•  Binary fields
•  E.g. Gender=M, F
•  Convert to Field_0_1 with 0, 1 values
•  e.g. Gender = M à Gender_0_1 = 0
Gender = F à Gender_0_1 = 1
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
24
Conversion: Ordered to Numeric
•  Ordered attributes (e.g. Grade) can be converted to
numbers preserving natural order, e.g.
•  A à 4.0
•  A- à 3.7
•  B+ à 3.3
•  B à 3.0
Why is it important to preserve natural order?
•  To allow meaningful comparisons, e.g. Grade > 3.5
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
25
Conversion: Nominal, Few Values
•  Multi-valued, unordered attributes with small (rule of
thumb < 20) no. of values
•  e.g. Color=Red, Orange, Yellow, …, Violet
•  for each value v create a binary “flag” variable C_v, which is 1 if
Color=v, 0 otherwise
•  Also called one-hot-encoding or dummy variable method.
ID
20/2/17
color
ID
C_red C_orange C_yellow
100 red
100 1
0
0
101 yellow
101 0
0
1
Tzompanaki Katerina - University of Cergy Pontoise
26
Conversion: Nominal, Many Values
•  Examples:
•  US State Code (50 values)
•  Profession Code (7,000 values, but only few frequent)
•  How to deal with such fields ?
•  Ignore ID-like fields whose values are unique for each
record.
•  For other fields, group values “naturally”:
•  e.g. 50 US States à 3 or 5 regions
•  Profession à select most frequent ones, group the rest
•  Create binary flag-fields (one-hot-encoding) for selected
values.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
27
Noisy Data
•  Noise: random error or variance in a measured variable
•  Incorrect attribute values may be due to
•  faulty data collection instruments
•  data entry problems
•  data transmission problems
•  technology limitation
•  inconsistency in naming convention
•  Other data problems which requires data cleaning
•  duplicate records
•  incomplete data
•  inconsistent data
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
28
How to Handle Noisy Data?
•  Binning method
•  first sort data and partition into bins
•  then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
•  Clustering
•  detect and remove outliers
•  Combined computer and human inspection
•  detect suspicious values and check by human
•  Regression
•  smooth by fitting the data into (linear) regression functions
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
29
Simple Discretization Methods: Binning
•  Equal-width (distance) partitioning:
•  It divides the range into N intervals of equal size
•  if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
•  The most straightforward
•  Equal-depth (frequency) partitioning:
•  It divides the range into N intervals, each containing approximately
same number of samples
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
30
Binning Methods for Data Smoothing
•  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
•  Partition into (equi-depth) bins:
•  Bin 1: 4, 8, 9, 15
•  Bin 2: 21, 21, 24, 25
•  Bin 3: 26, 28, 29, 34
•  Smoothing by bin means:
•  Bin 1: 9, 9, 9, 9
•  Bin 2: 23, 23, 23, 23
•  Bin 3: 29, 29, 29, 29
•  Smoothing by bin boundaries:
•  Bin 1: 4, 4, 4, 15
•  Bin 2: 21, 21, 25, 25
•  Bin 3: 26, 26, 26, 34
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
31
Major Tasks in Data Preprocessing
•  Data cleaning
•  Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
•  Data integration
•  Integration of multiple databases, data cubes, or files
•  Data transformation
•  Normalization and aggregation
•  Data reduction
•  Obtains reduced representation in volume but produces the same
or similar analytical results
•  Data discretization
•  Part of data reduction but with particular importance, especially for
numerical data
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
32
Data Integration
•  Data integration
•  combines data from multiple sources into a coherent store
•  Schema integration
•  integrate metadata from different sources
•  Entity identification problem
•  identify real world entities from multiple data sources, e.g., A.cust-
id ≡B.cust-#
•  Detecting and resolving data value conflicts
•  for the same real world entity, attribute values from different
sources are different
•  possible reasons: different representations, different scales, e.g.,
meter vs. foot
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
33
Handling Redundant Data in Data Integration
•  Redundant data occur often when integrating multiple
databases
•  The same attribute may have different names in different
databases
•  One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
•  Redundant data may be able to be detected by correlation
analysis
•  Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
34
Major Tasks in Data Preprocessing
•  Data cleaning
•  Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
•  Data integration
•  Integration of multiple databases, data cubes, or files
•  Data transformation
•  Normalization and aggregation
•  Data reduction
•  Obtains reduced representation in volume but produces the same
or similar analytical results
•  Data discretization
•  Part of data reduction but with particular importance, especially for
numerical data
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
35
Data Reduction
•  Dimensionality reduction: reduce the number of
considered attributes
•  Principal component analysis
•  Wavelet transformation
•  Numerosity reduction: reduce the volume of data to
smaller but representative data representations
•  Sampling: pick some of the data
•  Clustering: create clusters of similar items, use clusters instead of
members.
•  Histograms: binning method
•  Data compression: compress data in lossless (if original
data can be reconstructed) or lossy (otherwise) manner
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
36
Dimensionality Reduction
•  Purpose
•  Avoid curse of dimensionality
•  Reduce amount of time and memory required by data mining
algorithms
•  Allow data to be more easily visualized
•  May help to eliminate irrelevant features or reduce noise
•  Feature selection
•  Select the most important features
•  Feature extraction
•  Find representative combinations of features to use instead.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
37
Dimensionality reduction: Feature selection
•  Feature selection
•  Select a minimum set of features such that the probability
distribution of different classes given the values for those features
is as close as possible to the original distribution. 2n possible
subsets!
•  Expert knowledge can be utilized to keep the most important
features.
•  Automatic feature selection
•  Model-based selection
•  The most important features are selected using a supervised ML
algorithm (eg decision tree).
•  Iterative selection
•  Iteratively the least important features are discarded (backward
elimination) or the most important are added (forward selection) until the
desired number is reached.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
38
Dimensionality reduction: Feature selection
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
39
Dimensionality reduction: Feature extraction
Principal Component Analysis (PCA)
•  Given N data vectors from k-dimensions, find c <= k orthogonal
vectors that can be best used to represent data
•  The original data set is reduced to a new one consisting of N data
vectors on c principal components (reduced dimensions)
•  Each data vector is a linear combination of the c principal
component vectors
•  Works for numeric data only
•  We will see PCA in detail, when we will study unsupervised
learning methods.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
40
Principal Component Analysis (PCA)
The perpendicular (orthogonal)
arrows show the principal
components of the data. The
blue is the first principal
component, the pink is the
second one.
*http://austingwalters.com/pca-principal-component-analysis/
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
41
Numerosity Reduction: Sampling
•  Simple random sample without replacement (SRSWOR)
of size s: randomly pick s samples, all with equal
probability
•  Simple random sample with replacement (SRSWR) of
size s: the same item can be picked more than once
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
42
Numerosity Reduction: Sampling
•  Simple random sample without replacement (SRSWOR)
of size s: randomly pick s samples, all with equal
probability
•  Simple random sample with replacement (SRSWR) of
size s: the same item can be picked more than once
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
43
Numerosity Reduction: Sampling
•  Cluster sample: when data are clustered, pick randomly s
number of them. Eg. data retrieved in memory pages
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
44
Numerosity Reduction: Sampling
•  Cluster sample: when data are clustered, pick randomly s
number of them. Eg. data retrieved in memory pages
•  Stratified sample: create strata (levels) in the data to
represent different categories. Then, pick a number of
samples from each strata accordingly. In this way, all
strata will be guaranteed to exist in the samples.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
45
Numerosity Reduction: Clustering
•  Create partitions of data objects (clusters), so that objects
within a cluster are “similar” to one another and
“dissimilar” to objects in other clusters. Then use clusters
instead of elements in the clusters.
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
46
Numerosity Reduction: Histograms
Use histogram representations instead of full data. As we
saw before, histograms (binning method) partition the data
distribution of an attribute A into disjoint buckets that are
•  Equal-width: In an equal-width histogram, the width of each bucket
range is uniform.
•  Equal-frequency (or equal-depth): In an equal-frequency histogram,
the buckets are created so that, roughly, the frequency of each
bucket is constant (i.e., each bucket contains roughly the same
number of contiguous data samples).
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
47
Major Tasks in Data Preprocessing
•  Data cleaning
•  Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
•  Data integration
•  Integration of multiple databases, data cubes, or files
•  Data transformation
•  Normalization and aggregation
•  Data reduction
•  Obtains reduced representation in volume but produces the same
or similar analytical results
•  Data discretization
•  Part of data reduction but with particular importance, especially for
numerical data
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
48
Data Transformation
•  Smoothing: remove noise from data
•  Discretization: binning, histograms, clusters
Common tasks with
data cleaning
•  Normalization: scale to fall within a small, specified range
•  min-max normalization
•  z-score normalization
•  normalization by decimal scaling
•  Concept hierarchy generalization: replace a value with a
higher class
•  Aggregation: summarization, data cube construction
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
49
Normalization
•  min-max normalization
v' =
v − min A
max A − min A
•  z-score normalization (standardization)
v' =
v − vA
σA
v’ has zero-mean and unit variance è
gaussian distribution
•  normalization by decimal scaling
v
v' = j , where j is the smallest integer s.t. max(| v' |< 1)
10
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
50
Concept Hierarchies
•  For numerical data it can be regarded as discretization
method. Eg salaries fall into different ranges.
•  For nominal data, hierarchies can be implicitly or explicitly
defined in schemas or by the data
•  Specification of a partial ordering of attributes explicitly at the
schema level by users or experts. Eg street < city < province or
state < country
•  Specification of a set of attributes for the hierarchy, but not of their
partial ordering. To find the ordering use the distict attribute values
cardinality.
country
province
city
street
15
distinct values
365
distinct values
3567
distinct values
674,339
distinct values
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
51
Sources
•  Han and Kamber: Data Mining, Concepts and Techniques
•  Nguyen Hung Son: Data cleaning and data preprocessing
•  Prof. Pier Luca Lanzi: Data Exploration and Preparation
•  Muller and Guido: Introduction to Machine Learning with Python
20/2/17
Tzompanaki Katerina - University of Cergy Pontoise
52