Download Exercises

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CPIT 440
Data Mining and Warehouse
Lab4
www.company.com
CPIT 440
Data Mining and Warehouse
Lab4: Outlines
• Data Mining Process
• Data Gathering and Preparation (Preprocessing)
– Techniques of the Data Preprocessing
•
•
•
•
Data Integration Techniques
Data Cleaning Techniques
Data Transformation Techniques
Data Discritization Techniques
– Definition and Exercises
www.company.com
CPIT 440
Data Mining and Warehouse
Data Mining Process
www.company.com
CPIT 440
Data Mining and Warehouse
Data Gathering and Preparation
• The data understanding phase involves data
collection and exploration.
• You can take a closer look at the data, you can
determine how well it addresses the business
problem.
• You might decide to remove some of the data or
add additional data.
• Data preparation can significantly improve the
information that can be discovered through data
mining.
www.company.com
CPIT 440
Data Mining and Warehouse
Data Gathering and Preparation
• The data preparation phase covers all the tasks
involved in creating the case table you will use to
build the model.
• Tasks include data cleansing, binning and
transformation.
• For example,
– you might transform a DATE_OF_BIRTH column to
AGE;
– you might insert the average income in cases where
the INCOME column is null.
www.company.com
CPIT 440
Data Mining and Warehouse
Data Preprocessing Techniques
• Data Integration Techniques:
– Correlation (Numerical Data) by using Excel
– Correlation (Categorical Data-Chi-Square Test) by
using Excel
• Data Cleaning Techniques:
– Fill the Missing Values by using ODM
– Outlier Treatment for Reducing Noise by using ODM
• Data Transformation Technique:
– Normalization by using ODM
• Data Discritization Technique:
– Discritization by using ODM
www.company.com
CPIT 440
Data Mining and Warehouse
Data Integration Technique
Definition:
• Sometimes too much information can reduce the
effectiveness of data mining.
• Data sets with many attributes may contain
groups of attributes that are:
• Irrelevant attributes which is simply add noise
to the data and affect model accuracy.
– Noise increases the size of the model and the time and
system resources needed for model building and
scoring.
www.company.com
CPIT 440
Data Mining and Warehouse
Data Integration Technique
• Or, correlated attributes that may actually be
measuring the same underlying feature.
– Their presence together in the build data can skew the
logic of the algorithm and affect the accuracy of the
model.
• To minimize the effects of noise, the technique
like correlation is sometimes a desirable
preprocessing step for data mining.
www.company.com
CPIT 440
Data Mining and Warehouse
Data Integration Technique
Exercises:
• Correlation (Numerical Data) by using Excel.
• Open Excel file Corr.xlsx
• Correlation Results will always be between -1 and 1
– 1 = Positive Correlation
– 0 = No Correlation
– -1 = Negative Correlation
www.company.com
CPIT 440
Data Mining and Warehouse
Data Cleaning Technique
1. Fill the Missing Values by using ODM:
– When building or applying a model, Oracle Data Mining
automatically replaces missing values of
– numerical attributes:
• with the mean, max/min, avg, specific value or zero
values.
– categorical attributes
• with the mode.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– Open ODM and import File demo_missing.csv
• Take a view on this file in the attribute
length_of_residence there are some data missing;
– Now we will apply a technique of data cleaning to fill
out the missing data.
• From ODM open Data Transform  Missing Value
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– This will open Missing Value Transformation Wizard
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– In the 4th step of wizard Select the Column (attribute)
on which you are going to apply missing Value
technique and then press on Transform button.
– You will see three option select Replace With – Mean.
– Continue with next button till finish.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
See the difference by using histogram, between
Missing Data and after Fill Out Data.
With Missing
After solving Missing
www.company.com
CPIT 440
Data Mining and Warehouse
Data Cleaning Technique
2. Outlier Treatment for Reducing Noise by using ODM:
– A value is considered an outlier if it deviates significantly
from most other values in the column.
– The presence of outliers can have a skewing effect on the
data and then can result in the inaccurate model
– Outlier treatment methods such as trimming or clipping can
be implemented to minimize the effect of outliers.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– Import File demo_outliers.csv
• Take a view on this file in the attribute
years_details_listed, there are some outliers (Noise),
means there are some values under this attribute which
are very far from other.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– Now we will apply a technique of data cleaning to
reduce this noise from the data.
– Open Data Transform Outlier Treatment
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
–
This will open Outlier Treatment Transformation Wizard
– In the 4th Step of wizard Select the Column (attribute)
on which you are going to apply outlier treatment
technique
– then press std.deviation button then select edge/null
values to be replaced with.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– Continue with next button till finish.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
See the difference by using histogram, between
Noisy data and after outlier treatment applied.
www.company.com
CPIT 440
Data Mining and Warehouse
Data Transformation Technique:
• Normalization by using ODM:
– Normalization is the technique that transforming
numerical values into a specific range, such as
[–1.0…1.0] or [0.0…1.0]
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– Import File demo_original.csv
• Take a view on this file in the attribute
family_income_indicator, we will apply normalize
technique.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– Open Data Transform Normalize
– This will open Normalize Transformation Wizard
– In the 3rd Step of wizard Select the Column (attribute)
on which you are going to apply normalize technique
and
– then press Define button then select min-max
transformation algorithm.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
• Continue with next button till finish.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
Notice the difference by using histogram,
before and after normalization.
www.company.com
CPIT 440
Data Mining and Warehouse
Data Discritization Technique:
• Discritization by using ODM
– Also called binning, is a technique for reducing the
cardinality of continuous and discrete data.
– It groups related values together in bins to reduce the
number of distinct values.
– Discritization can improve resource utilization and
model build response time dramatically without
significant loss in model quality.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
– Import File demo_original.csv
• Take a view on this file in the attribute
family_income_indicator, we will apply discritize
technique.
– Open Data Transform Discritize
– This will open Discritize Transformation Wizard
– In the 4th Step of wizard Select the Column (attribute)
on which you are going to apply discritize technique
and
– then press Equal Width button then write 10 number
of bins.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
• Continue with next button till finish.
www.company.com
CPIT 440
Data Mining and Warehouse
Exercise
See the difference by using histogram, before
and after discritization.
Before
After
www.company.com
Related documents