Download Data Preprocessing

Document related concepts

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data analysis wikipedia , lookup

Relational model wikipedia , lookup

Forecasting wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data Preprocessing

Relational Databases - Normalization


Denormalization
Data Preprocessing




Missing Data
Missing values and the 3VL approach
Problems with 3VL approach
Special Values
Remember:
Relational Databases



Model entities and relationships
Entities are the things in the real world
Information about employees and the
department they work for


Employee and department are entities
Relationships are the links between these
entities

Employee works for a department
Relation or Tables

Relation: a table of data
• table = relation,(set theory, based on predicate logic)
emplyeeID
name
job
departmentID
7513 Nora
Programmer
128
9842 Ben
DBA
6651 Alex
Programmer
128
9006 Claudia
SystemAdministrator
128
42
Columns and Rows

Each column or attribute describes some
piece of data that each record in the table
has

Each row in a table represents a record

Rows, records or tupels
Keys

A superkey is a column (or a set of columns) that can
be used to identify a row in as table


There are different possible keys




a key is a minimal superkey
candidate keys
We chose form the candidate keys the primary key
Primary key is used to identify a single row (record)
Foreign keys represents links between tables
Keys
primary key
 foreign key

emplyeeID
name
job
departmentID
7513 Nora
Programmer
128
9842 Ben
DBA
6651 Alex
Programmer
128
9006 Claudia
SystemAdministrator
128
42
Functional Dependencies

If there is a functional dependency between columns A
and B in a given table which may be written
AB


Then the value of column A determines the value of
column B
employeeID functionally determines the name

Schema

Database schema
Structure or design of the database
 Database without any data in it

employee(employeeID,name,job,departmentID)
Design

Minimize redundancy

Redundancy: data is repeated in different
rows
employee(employeeID,name,job,departmentID,departmentName)
emplyeeID
name
job
departmentID
7513
Nora
Programmer
9842
Ben
DBA
6651
Alex
Programmer
128
Research and Development
9006
Claudia
System-Administrator
128
Research and Development
128
42
departamentName
Research and Development
Finance
Reduce redudancy
employee(employeeID,name,job,departmentID,departmentName)
employee(employeeID,name,job,departmentID)
employee(departmentID,name)
Insert Anomalies
Insert data into flawed table
 Data does not match what is already in
the table
 It is not obvious which of the rows in the
database is correct

Deletion Anomalies

Delete data from a flawed schema

When we delete all the employees of
Department 128, we no longer have any
record, that the Department 128 exists
Update Anomalies

Change data in a flawed schema

We do not change the data for every row
correctly
Null Values

Avoid schema designs that have large
numbers of empty attributes
Normalization
Remove design flaws from a database
 Normal forms, which are a set of rules
describing what we should and should not
do in our table structures
 Breaking tables into smaller tables that
form a better design

Normal Forms
1 Forma Normal
2 Forma Normal
3 Forma Normal
Forma Normal Boyce Codd
4 Forma Normal
5 Forma Normal
First Normal Form (1NF)
Each attribute or column value must be
atomic
 Each attribute must contain a single value

emplyeeID
name
job
department
ID
skills
7513 Nora
Programmer
128 C, Perl, Java
9842 Ben
DBA
6651 Alex
Programmer
128 VB, Java
9006 Claudia
System-Administrator
128 NT, Linux
42 DB2
1NF
emplyeeID
name
job
departmentID
skills
7513 Nora
Programmer
128 C
7513 Nora
Programmer
128 Perl
7513 Nora
Programmer
128 java
9842 Ben
DBA
6651 Alex
Programmer
128 VB
6651 Alex
Programmer
128 java
9006 Claudia
System-Administrator
128 NT
9006 Claudia
System-Administrator
128 Linux
42 DB2
Second Normal Form (2NF)

All attributes that are no part of the
primary key are fully dependent on the
primary key


Each non key attribute must be functionally
dependent on the key
Is already in 1NF
2NF ?
emplyeeID
name
job
departmentID
skills
7513 Nora
Programmer
128 C
7513 Nora
Programmer
128 Perl
7513 Nora
Programmer
128 java
9842 Ben
DBA
6651 Alex
Programmer
128 VB
6651 Alex
Programmer
128 java
9006 Claudia
System-Administrator
128 NT
9006 Claudia
System-Administrator
128 Linux
employee(employeeID,name,job,departmentID,skill)
42 DB2
Functional dependencies


employeeID,skill
employeeID


name, job, deparmentID
name, job, deparmentID
Partially functionally dependent on the
primary key
 Not fully functionally dependent on the
primary key

2NF

Decompose the table into tables which all
the non-key attributes are fully functionally
dependent on the key

Breaking the table into two tables


employee(employeeID,name,job,departmentID)
employeeSkills(employeeID,skill)
Third Normal Form (3NF)
Remove all transitive dependencies
 Be in 2NF

employeID
name
job
departmentID department
Name
7513 Nora
Programmer
9842 Ben
DBA
6651 Ajay
Programmer
42 Finance
9006 Candy SYS

128 Research
128 Research
128 Research
employee(employeeID,name,job,departmentID,departmentName)


employeeID
departmentID


name,job,departmentID,departmentName
departmentName
Transitive dependency

employeeID
employeeID


departmentName
deparmtentID

departmentID 
departmentName

3NF
Remove transitive dependency
 Decompose into multiple tables



emploee(employeeID,name,jop,departmentID)
deparment(deparmentID,deparmtentName)
3NF

The left side of the functional dependency is a
superkey (that is, a key that is not necessarily
minimal)

Boyce-Codd Normal Form
or

The right side of the functional dependency is a
part of any key of the table
BCNF

All attributes must be functionally determined by
a superkey




Full normalization means lots of logically
seperate relations
Lots of logically separate relations means a lot
of physically separate files
Lots of physically separate files means a lot of
I/O
Difficulties in finding dimensions for dimensional
schema, star schema

(dimension tables, fact table)
What is Denormalization?

Normalizing a relational variable R means
replacing R by a set of projections R1,R2,..,Rn
such that R is equal to the join R1,R2,..,Rn


Denormalizing the relational variables means
replacing them by their join R


Reduce redundancy, each projections R1,R2,..,Rn is
at the highest possible value of normalization
Increase redundancy, by ensuring that R is a lower
level of normalization than R1,R2,..,Rn
Problems

Once we start to denormalize, it is not clear when to
stop?
Dimensional Schema

Array cells often empty




The more dimensions, there more empty cells
Empty cell  Missing information
How to treat not present information ?
How does the system support
•
•
•
•

Information is unknown
Has been not captured
Not applicable
....
Solution?
Why Data Preprocessing?

Data in the real world is dirty

incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”

noisy: containing errors or outliers
• e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why Is Data Dirty?

Incomplete data may come from




Noisy data (incorrect values) may come from




Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from



“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Why Is Data Preprocessing
Important?

No quality data, no quality mining results!

Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.


Data warehouse needs consistent integration of quality
data
Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse
Multi-Dimensional Measure of
Data Quality

A well-accepted multidimensional view:








Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Major Tasks in Data Preprocessing

Data cleaning


Data integration


Normalization and aggregation
Data reduction


Integration of multiple databases, data cubes, or files
Data transformation


Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization

Part of data reduction but with particular importance, especially
for numerical data
Forms of Data Preprocessing
Data Cleaning

Importance



“Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration
Missing Data

Data is not always available



E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of
entry

not register history or changes of the data
Missing data may need to be inferred
Missing Values
The approach of the problem of missing
values adopted in SQL is based on nulls
and three-valued logic (3VL)
 null corresponds to UNK for unknown
 3VL a mistake?

Boolean Operators
AND t
u
f
OR
t
u
f
NOT
t
t
u
f
t
t
t
t
t
f
u
u
u
f
u
t
u
u
u
u
f
f
f
f
f
t
u
f
f
t

In scalar comparison in which either of the
compared is UNK evaluates the unknown truth
value
MAYBE

Another important Boolean operator is
MAYBE
MAYBE
t
f
u
t
f
f
Example

Consider the query “Get employees who
may be- but are not definitely known to
be- programmers born before January 18,
1971, with salary less then €40.000
EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND
DOB < DATE (‘1971-1-18’) AND
SALLARY < 40000 )

Without maybe we assume the existence of another operator called
IS_UKN which takes a single scalar operand and returns true if
operand evaluates UNK otherwise false
EMP WHERE ( JOB = ‘PROGRAMMER’
OR IS_UKN (JOB) )
AND ( DOB < DATE (‘1971-1-18’)
OR IS_UKN (DOB) )
AND ( SALLARY < 40000
OR IS_UKN (SALLARY) )
AND NOT ( JOB = ‘PROGRAMMER’ AND
DOB < DATE (‘1971-1-18’) AND
SALLARY < 40000 )
Numeric expression

WEIGHT * 454
If WEIGHT is UKN, then the result is also
UKN
 Any numeric expression is considered to
evaluate UNK if any operands of that
expression is itself UNK


Anomalies
WEIGHT-WEIGHT=UNK
 WEIGHT/0=UNK

(0)
(“zero divide”)
UNK is not u (unk)



UNK (the value-unknown null)
u (unk) (unknown truth value)
...are not the same thing


u is a value, UNK not a value at all!
Suppose X is BOOLEAN



Has tree values: t (true),f (false), u ukn
X is ukn, X is known to be unk
X is UKN, X is not known!
Some 3VL Consequences

The comparison x=x does not give true


In 3VL x is not equal to itself it is happens to
be UNK
The Boolean expression p OR NOT(p)
does not give necessarily true

unk OR NOT (unk) = unk
Example

Get all suppliers in Porto and take the union
with get all suppliers not in Porto


We need to add maybe in Porto



We do not get all suppliers!
In 2 VL p OR NOT(p) corresponds to
p OR NOT(p) OR MAYBE(p) in 3VL
While two cases my exhaust full range of
possibilities in the real world, the database does
not contain the real world - instead it contains
only knowledge about real world
Some 3VL Consequences
The expression r JOIN r does not
necessarily give r
 A=B and B=C together does not imply
A=C
 ....
 Many equivalences that are valid in 2VL
break down in 3VL
 We will get wrong answers

Special Values
Drop the idea of null and UNK,unk 3VL
 Use special values instead to represent
missing information


Special values are used in the real world

In the real world we might use the special
value „?“ to denote hours worked by a certain
employee if actual value is unknown
Special Values

General Idea:

Use an appropriate special value, distinct from
all regular values of the attribute in question,
when no regular value can be used
The special value must be of the applicable
attribute is not just integers, but integers
integers plus whatever the special value is
Approach is not very elegant, but without 3VL
problems, because it is in 2VL


How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class: smarter

the most probable value: inference-based such as Bayesian formula or
decision tree

Relational Databases - Normalization


Denormalization
Data Preprocessing




Missing Data
Missing values and the 3VL approach
Problems with 3VL approach
Special Values
Next..

Data Preprocessing
Visual inspection
 Noise Reduction
 Data Reduction
 Data Discretization
 Data Integration
