Download Data Preprocessing

Data Preprocessing  Relational Databases - Normalization   Denormalization Data Preprocessing     Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values Remember: Relational Databases    Model entities and relationships Entities are the things in the real world Information about employees and the department they work for   Employee and department are entities Relationships are the links between these entities  Employee works for a department Relation or Tables  Relation: a table of data • table = relation,(set theory, based on predicate logic) emplyeeID name job departmentID 7513 Nora Programmer 128 9842 Ben DBA 6651 Alex Programmer 128 9006 Claudia SystemAdministrator 128 42 Columns and Rows  Each column or attribute describes some piece of data that each record in the table has  Each row in a table represents a record  Rows, records or tupels Keys  A superkey is a column (or a set of columns) that can be used to identify a row in as table   There are different possible keys     a key is a minimal superkey candidate keys We chose form the candidate keys the primary key Primary key is used to identify a single row (record) Foreign keys represents links between tables Keys primary key  foreign key  emplyeeID name job departmentID 7513 Nora Programmer 128 9842 Ben DBA 6651 Alex Programmer 128 9006 Claudia SystemAdministrator 128 42 Functional Dependencies  If there is a functional dependency between columns A and B in a given table which may be written AB   Then the value of column A determines the value of column B employeeID functionally determines the name  Schema  Database schema Structure or design of the database  Database without any data in it  employee(employeeID,name,job,departmentID) Design  Minimize redundancy  Redundancy: data is repeated in different rows employee(employeeID,name,job,departmentID,departmentName) emplyeeID name job departmentID 7513 Nora Programmer 9842 Ben DBA 6651 Alex Programmer 128 Research and Development 9006 Claudia System-Administrator 128 Research and Development 128 42 departamentName Research and Development Finance Reduce redudancy employee(employeeID,name,job,departmentID,departmentName) employee(employeeID,name,job,departmentID) employee(departmentID,name) Insert Anomalies Insert data into flawed table  Data does not match what is already in the table  It is not obvious which of the rows in the database is correct  Deletion Anomalies  Delete data from a flawed schema  When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists Update Anomalies  Change data in a flawed schema  We do not change the data for every row correctly Null Values  Avoid schema designs that have large numbers of empty attributes Normalization Remove design flaws from a database  Normal forms, which are a set of rules describing what we should and should not do in our table structures  Breaking tables into smaller tables that form a better design  Normal Forms 1 Forma Normal 2 Forma Normal 3 Forma Normal Forma Normal Boyce Codd 4 Forma Normal 5 Forma Normal First Normal Form (1NF) Each attribute or column value must be atomic  Each attribute must contain a single value  emplyeeID name job department ID skills 7513 Nora Programmer 128 C, Perl, Java 9842 Ben DBA 6651 Alex Programmer 128 VB, Java 9006 Claudia System-Administrator 128 NT, Linux 42 DB2 1NF emplyeeID name job departmentID skills 7513 Nora Programmer 128 C 7513 Nora Programmer 128 Perl 7513 Nora Programmer 128 java 9842 Ben DBA 6651 Alex Programmer 128 VB 6651 Alex Programmer 128 java 9006 Claudia System-Administrator 128 NT 9006 Claudia System-Administrator 128 Linux 42 DB2 Second Normal Form (2NF)  All attributes that are no part of the primary key are fully dependent on the primary key   Each non key attribute must be functionally dependent on the key Is already in 1NF 2NF ? emplyeeID name job departmentID skills 7513 Nora Programmer 128 C 7513 Nora Programmer 128 Perl 7513 Nora Programmer 128 java 9842 Ben DBA 6651 Alex Programmer 128 VB 6651 Alex Programmer 128 java 9006 Claudia System-Administrator 128 NT 9006 Claudia System-Administrator 128 Linux employee(employeeID,name,job,departmentID,skill) 42 DB2 Functional dependencies   employeeID,skill employeeID   name, job, deparmentID name, job, deparmentID Partially functionally dependent on the primary key  Not fully functionally dependent on the primary key  2NF  Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key  Breaking the table into two tables   employee(employeeID,name,job,departmentID) employeeSkills(employeeID,skill) Third Normal Form (3NF) Remove all transitive dependencies  Be in 2NF  employeID name job departmentID department Name 7513 Nora Programmer 9842 Ben DBA 6651 Ajay Programmer 42 Finance 9006 Candy SYS  128 Research 128 Research 128 Research employee(employeeID,name,job,departmentID,departmentName)   employeeID departmentID   name,job,departmentID,departmentName departmentName Transitive dependency  employeeID employeeID   departmentName deparmtentID  departmentID  departmentName  3NF Remove transitive dependency  Decompose into multiple tables    emploee(employeeID,name,jop,departmentID) deparment(deparmentID,deparmtentName) 3NF  The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal)  Boyce-Codd Normal Form or  The right side of the functional dependency is a part of any key of the table BCNF  All attributes must be functionally determined by a superkey     Full normalization means lots of logically seperate relations Lots of logically separate relations means a lot of physically separate files Lots of physically separate files means a lot of I/O Difficulties in finding dimensions for dimensional schema, star schema  (dimension tables, fact table) What is Denormalization?  Normalizing a relational variable R means replacing R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn   Denormalizing the relational variables means replacing them by their join R   Reduce redundancy, each projections R1,R2,..,Rn is at the highest possible value of normalization Increase redundancy, by ensuring that R is a lower level of normalization than R1,R2,..,Rn Problems  Once we start to denormalize, it is not clear when to stop? Dimensional Schema  Array cells often empty     The more dimensions, there more empty cells Empty cell  Missing information How to treat not present information ? How does the system support • • • •  Information is unknown Has been not captured Not applicable .... Solution? Why Data Preprocessing?  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“ ”  noisy: containing errors or outliers • e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records Why Is Data Dirty?  Incomplete data may come from     Noisy data (incorrect values) may come from     Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from    “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning Why Is Data Preprocessing Important?  No quality data, no quality mining results!  Quality decisions must be based on quality data • e.g., duplicate or missing data may cause incorrect or even misleading statistics.   Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse Multi-Dimensional Measure of Data Quality  A well-accepted multidimensional view:         Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Major Tasks in Data Preprocessing  Data cleaning   Data integration   Normalization and aggregation Data reduction   Integration of multiple databases, data cubes, or files Data transformation   Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Obtains reduced representation in volume but produces the same or similar analytical results Data discretization  Part of data reduction but with particular importance, especially for numerical data Forms of Data Preprocessing Data Cleaning  Importance    “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration Missing Data  Data is not always available    E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data Missing data may need to be inferred Missing Values The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)  null corresponds to UNK for unknown  3VL a mistake?  Boolean Operators AND t u f OR t u f NOT t t u f t t t t t f u u u f u t u u u u f f f f f t u f f t  In scalar comparison in which either of the compared is UNK evaluates the unknown truth value MAYBE  Another important Boolean operator is MAYBE MAYBE t f u t f f Example  Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000 EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )  Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false EMP WHERE ( JOB = ‘PROGRAMMER’ OR IS_UKN (JOB) ) AND ( DOB < DATE (‘1971-1-18’) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 ) Numeric expression  WEIGHT * 454 If WEIGHT is UKN, then the result is also UKN  Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK   Anomalies WEIGHT-WEIGHT=UNK  WEIGHT/0=UNK  (0) (“zero divide”) UNK is not u (unk)    UNK (the value-unknown null) u (unk) (unknown truth value) ...are not the same thing   u is a value, UNK not a value at all! Suppose X is BOOLEAN    Has tree values: t (true),f (false), u ukn X is ukn, X is known to be unk X is UKN, X is not known! Some 3VL Consequences  The comparison x=x does not give true   In 3VL x is not equal to itself it is happens to be UNK The Boolean expression p OR NOT(p) does not give necessarily true  unk OR NOT (unk) = unk Example  Get all suppliers in Porto and take the union with get all suppliers not in Porto   We need to add maybe in Porto    We do not get all suppliers! In 2 VL p OR NOT(p) corresponds to p OR NOT(p) OR MAYBE(p) in 3VL While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world Some 3VL Consequences The expression r JOIN r does not necessarily give r  A=B and B=C together does not imply A=C  ....  Many equivalences that are valid in 2VL break down in 3VL  We will get wrong answers  Special Values Drop the idea of null and UNK,unk 3VL  Use special values instead to represent missing information   Special values are used in the real world  In the real world we might use the special value „?“ to denote hours worked by a certain employee if actual value is unknown Special Values  General Idea:  Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is Approach is not very elegant, but without 3VL problems, because it is in 2VL   How to Handle Missing Data?  Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.  Fill in the missing value manually: tedious + infeasible?  Fill in it automatically with  a global constant : e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian formula or decision tree  Relational Databases - Normalization   Denormalization Data Preprocessing     Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values Next..  Data Preprocessing Visual inspection  Noise Reduction  Data Reduction  Data Discretization  Data Integration 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Preprocessing