Download Microsoft PowerPoint - Week#03 - Data Preprocessing.ppt [\342\313

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Functional Database Model wikipedia , lookup

Transcript
Agenda
•
•
•
•
•
Data Preprocessing
Opas Wongtaweesap (Aj’OaT)
Intelligent Information Systems Development and
Research Laboratory Centre
Why Data Preprocessing?
• Data in the real world is dirty
Why preprocess the data?
Data Cleaning
Data Integration
Data Transformation
Summary
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of
quality data
URL:: http://www.facebook.com/OaTCoM,
http://twitter.com/OaTCoM
E-mail:: [email protected], [email protected]
Data Mining & WEKA
Multi-Dimensional Measure of
Data Quality
• A well-accepted multidimensional view:
2
–
–
–
–
• Broad categories:
– intrinsic, contextual, representational, and
accessibility.
3
Data Preprocessing Tasks
(Con’t)
Data Preprocessing Tasks
• Data Cleaning
– Accuracy, Completeness, Consistency, Timeliness,
Believability, Value added, Interpretability,
Accessibility
Data Mining & WEKA
• Data Reduction
– Obtains reduced representation in volume but
produces the same or similar analytical results
Fill in missing values.
Smooth noisy data.
Identify or remove outliers.
Resolve inconsistencies
• Data Discretization
– Part of data reduction but with particular importance,
especially for numerical data
• Data Integration
– Integration of multiple databases.
– Data cubes, or files
• Data Transformation
– Normalization and aggregation.
Data Mining & WEKA
4
Forms of data preprocessing
Data Mining & WEKA
5
Data Cleaning
Data Mining & WEKA
6
Missing Data
• Data is not always available
• Data Cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus
deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the
time of entry
– not register history or changes of the data
Data Mining & WEKA
7
Data Mining & WEKA
8
• Missing data mayDataneed
to be inferred.
Mining & WEKA
9
How to Handle Missing Data?
How to Handle Missing Data?
• Ignore the tuple:
• Use the attribute mean for all samples
belonging to the same class to fill in the missing
value: smarter
• Use the most probable value to fill in the missing
value: inference-based such as Bayesian
formula or decision tree
– usually done when class label is missing (assuming
the tasks in classification—not effective when the
percentage of missing values per attribute varies
considerably.
• Fill in the missing value manually: tedious +
infeasible?
• Use a global constant to fill in the missing value:
e.g., “unknown”, a new class?!
• Use the attribute mean to fill in the missing
value
Data Mining & WEKA
10
Data Mining & WEKA
Handling Redundant Data in
Data Integration
Data Integration (Con’t)
• Detecting and resolving data value conflicts
– The same attribute may have different names in
different databases
– One attribute may be a “derived” attribute in another
table
13
Normalization
v' =
v − minA
(new _ maxA − new _ minA) + new _ minA
maxA − minA
v
10 j
– integrate metadata from different sources
– Entity identification problem: identify real world
entities from multiple data sources, e.g., Sex =
Gender, StdCode = StudentCode
Data Mining & WEKA
12
• Smoothing: remove noise from data
• Aggregation: summarization, data cube
construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
• Attribute/feature construction
• New attributes constructed
from the given ones 15
Data Mining & WEKA
Discretization
Three types of attributes
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Nominal — values from an unordered set
• Ordinal — values from an ordered set
• Discretization:
– divide the range of a continuous attribute into
intervals
– Some classification algorithms only accept categorical
attributes.
– Reduce data size by discretization
– Prepare for further analysis
Where j is the smallest integer such that Max(| v ' |)<1
Data Mining & WEKA
• Schema Integration
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• z-score normalization
v − mean A
v'=
stand _ dev A
• normalization by decimal scaling
v' =
– combines data from multiple sources into a coherent
store
• Redundant data may be able to be detected by
correlational analysis
• Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed
and quality
Data Mining & WEKA
14
• Three types of attributes:
• min-max normalization
• Data Integration:
Data Transformation
• Redundant data occur often when integration of
multiple databases
– for the same real world entity, attribute values from
different sources are different
– possible reasons: different representations, different
scales, e.g., Coke = Coka Cola, Pepsi
Data Mining & WEKA
11
Data Integration
16
Data Mining & WEKA
16
17
• Continuous — real numbers
Data Mining & WEKA
18
Normalization
Data Mining & WEKA
Discretization Example I
19
Data Mining & WEKA
Discretization Example II
20
Data Mining & WEKA
21
19
Discretization and Concept
hierachy
Concept hierarchies
• Concept hierarchy can be automatically generated
based on the number of distinct values per attribute in
the given attribute set. The attribute with the most
distinct values is placed at the lowest level of the
hierarchy.
• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
• Concept hierarchies
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young, middleaged, or senior).
– Age<15 is young.
country
15 distinct values
province_or_ state
65 distinct
values
3567 distinct values
city
street
Data Mining & WEKA
22
Data Mining: A KDD Process
Data Mining & WEKA
Data Cleaning Summary
25
• Data Preparation is a big issue for both
warehousing and mining
• Data Preparation includes
– Data Cleaning and Data Integration
– Data Reduction and Feature Selection
• Discretization
• A lot a methods have been developed but still an
active area of research
674,339 distinct values
Data Mining & WEKA
23
Data Mining & WEKA
24