Download Printable version - University of Alberta

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Summary of Last Chapter
Principles of Knowledge
Discovery in Databases
• What is a data warehouse and what is it for?
Fall 1999
• What is the multi-dimensional data model?
Chapter 3: Data Preprocessing
• What is the difference between OLAP and OLTP?
• What is the general architecture of a data warehouse?
Dr. Osmar R. Zaïane
• How can we implement a data warehouse?
• Are there issues related to data cube technology?
• Can we mine data warehouses?
University of Alberta
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
1
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
2
Course Content
Chapter 3 Objectives
•
•
•
•
•
•
•
•
•
•
Introduction to Data Mining
Data warehousing and OLAP
Data cleaning
Data mining operations
Data summarization
Association analysis
Classification and prediction
Clustering
Web Mining
Similarity Search
•
Other topics if time permits
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
Realize the importance of data preprocessing
for real world data before data mining or
construction of data warehouses.
Get an overview of some data preprocessing
issues and techniques.
University of Alberta
3
 Dr. Osmar R. Zaïane, 1999
4
In real world applications data can be inconsistent, incomplete
and/or noisy.
• What is the motivation behind data preprocessing?
Errors can happen:
• What is data cleaning and what is it for?
• Faulty data collection instruments
• Data entry problems
• Human misjudgment during data entry
• Data transmission problems
• Technology limitations
• Discrepancy in naming conventions
• What is data integration and what is it for?
• What is data transformation and what is it for?
• What is data reduction and what is it for?
Results:
• What is data discretization?
• Duplicated records
• Incomplete data
• Contradictions in data
• How do we generate concept hierarchies?
Principles of Knowledge Discovery in Databases
University of Alberta
Motivation
Data Preprocessing Outline
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
5
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
6
1
Motivation (Con’t)
Data Preprocessing
Data Warehouse
Data Cleaning
Data Mining
Data Integration
Decision
Data
What happens when the data can not be trusted?
Can the decision be trusted? Decision making is jeopardized.
Data Transformation
Better chance to
discover useful
knowledge when
data is clean.
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
7
Data Reduction
 Dr. Osmar R. Zaïane, 1999
• What is data cleaning and what is it for?
• What is data transformation and what is it for?
• What is data reduction and what is it for?
•
•
•
• What is data discretization?
• How do we generate concept hierarchies?
9
Solving Missing Data
10
Fill in missing values
Smooth out noisy data
Correct inconsistencies
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
Smoothing Noisy Data
The purpose of data smoothing is to eliminate noise.
This can be done by:
• Ignore the tuple with missing values;
• Fill in the missing values manually;
• Binning
• Clustering
• Regression
• Use a global constant to fill in missing values (NULL, unknown, etc.);
• Use the attribute value mean to filling missing values of that
attribute;
y
Data regression consists of fitting the data to
a function. A linear regression for instance,
finds the line to fit 2 variables so that one
variable can predict the other.
More variables can be involved in a multiple
linear regression.
• Use the attribute mean for all samples belonging to the same
class to fill in the missing values;
• Infer the most probable value to fill in the missing value.
 Dr. Osmar R. Zaïane, 1999
University of Alberta
No recorded values for some attributes
Not considered at time of entry
Random errors
…
Data cleaning attempts to:
• What is data integration and what is it for?
University of Alberta
8
Real-world application data can be incomplete,
noisy, and inconsistent.
• What is the motivation behind data preprocessing?
Principles of Knowledge Discovery in Databases
University of Alberta
Data Cleaning
Data Preprocessing Outline
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
Principles of Knowledge Discovery in Databases
University of Alberta
11
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
Y1
y=x+1
Y1’
X1
University of Alberta
x
12
2
Binning
Clustering
Binning smoothes the data by consulting the value’s neighbourhood.
Data is organized into groups of “similar” values.
Rare values that fall outside these groups are
considered outliers and are discarded.
First, the data is sorted to get the values “in their neighbourhoods”.
Second, the data is distributed in equi-width bins:
Ex: 4, 8, 15, 21, 21, 24, 25, 28, 34
Bins of depth 3:
Bin1: 4, 8, 15
Bin2: 21, 21, 24
Bin3: 25, 28, 34
Third, process local smoothing.
Smoothing by bin median
Smoothing by bin means
Smoothing by bin boundaries
Bin1: 9, 9, 9
Bin2: 22, 22, 22
Bin3: 29, 29, 29
Bin1: 4, 4, 15
Bin2: 21, 21, 24
Bin3: 25, 25, 34
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
13
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
14
Data Integration
Data Preprocessing Outline
Data analysis may require a combination of data
from multiple sources into a coherent data store.
• What is the motivation behind data preprocessing?
• What is data cleaning and what is it for?
There are many challenges:
•Schema integration: CID ≈ C_number ≈ Cust-id ≈ cust#
•Semantic heterogeneity
• Data value conflicts (different representations or scales, etc.)
•Redundant records
•Redundant attributes (redundant if it can be derived from other attributes)
• What is data integration and what is it for?
• What is data transformation and what is it for?
• What is data reduction and what is it for?
• What is data discretization?
•Correlation analysis P(A∧B)/(P(A)P(B))
1: independent, >1 positive correlation, <1: negative correlation.
• How do we generate concept hierarchies?
Metadata is often necessary
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
15
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
16
Data Transformation
Data Preprocessing Outline
Data is sometimes in a form not appropriate for mining.
Either the algorithm at hand can not handle it, the form
of the data is not regular, or the data itself is not specific
enough.
• What is the motivation behind data preprocessing?
• What is data cleaning and what is it for?
• What is data integration and what is it for?
• What is data transformation and what is it for?
• Normalization (to compare carrots with carrots)
• Smoothing
• Aggregation (summary operation applied to data)
• Generalization (low level data is replaced with level data – concept hierarchy)
• What is data reduction and what is it for?
• What is data discretization?
• How do we generate concept hierarchies?
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
17
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
18
3
Normalization
Data Preprocessing Outline
Min-max normalization: linear transformation from v to v’
v’= v-min/(max – min) (newmax – newmin) + new min
Ex: transform $30000 between [10000..45000] into [0..1] 30-10/35(1)+0=0.514
• What is the motivation behind data preprocessing?
Zscore normalization: normalization v into v’ based on attribute value
mean and standard deviation v’=v-Mean/StandardDeviation
• What is data integration and what is it for?
Î
• What is data cleaning and what is it for?
• What is data transformation and what is it for?
Normalization by decimal scaling: moves the decimal point of v by j
positions such that j is the minimum number of positions moved to the
decimal of the absolute maximum value to make is fall in [0..1].
v’=v/10j
Ex: if v ranges between –56 and 9976, j=4 v’ ranges between –0.0056 and 0.9976
Î
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
19
• What is data reduction and what is it for?
• What is data discretization?
• How do we generate concept hierarchies?
 Dr. Osmar R. Zaïane, 1999
The data is often too large. Reducing the data can improve performance.
Data reduction consists of reducing the representation of the data set
while producing the same (or almost the same) results.
 Dr. Osmar R. Zaïane, 1999
20
Reduce the data to the concept level needed in the analysis.
Time (years)
Time (Quarters per years)
Location
Q1 Q2 Q3 Q4
Maritimes
Drama
Quebec
(province, Ontario
Prairies
Canada)Western
Pr
Comedy
Horror
Edmonton
Montreal
Toronto
Vancouver
Location
(city)
Category
19981999
1997
Drama
Comedy
Horror
•Numerosity reduction
ƒRegression
ƒHistograms
ƒClustering
ƒSampling
University of Alberta
Category
Sci. Fi..
Sci. Fi..
Principles of Knowledge Discovery in Databases
University of Alberta
Data Cube Aggregation
Data Reduction
Data reduction includes:
•Data cube aggregation
•Dimension reduction
•Data compression
•Discretization
Principles of Knowledge Discovery in Databases
Reduce
Queries regarding aggregated information should be answered
using data cube when possible.
21
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
22
Decision-Tree Induction
Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Initial attribute set:
{A1, A2, A3, A4, A5}
• Select only the necessary attributes.
• The goal is to find a minimum set of attributes such that the resulting
probability distribution of data classes is as close as possible to the original
distribution obtained using all attributes.
• Exponential number of possibilities.
A1 ?
A5?
A3?
Use heuristics: select local ‘best’ (or most pertinent) attribute.
Information gain, etc.
Class 1
¾step-wise forward selection {}{A1}{A1,A3}{A1,A3,A5}
¾step-wise backward elimination {A1,A2,A3,A4,A5} {A1,A3,A4,A5} {A1,A3,A5}
¾combining forward selection and backward elimination
¾decision-tree induction
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
>
23
 Dr. Osmar R. Zaïane, 1999
Class 2
Class 1
Class 2
Reduced attribute set: {A1, A3, A5}
Principles of Knowledge Discovery in Databases
University of Alberta
24
4
Data Compression
Numerosity Reduction
Data compression reduces the size of data.
saves storage space.
saves communication time.
Data volume can be reduced by choosing alternative forms of data
representation.
Î
Î
Parametric
•Regression (a model or function estimating the distribution instead of
There is lossless compression and lossy compression.
Used for all sorts of data. Some methods are data specific, others
are versatile.
the data.)
Non-parametric
•Histograms
•Clustering
•Sampling
For data mining, data compression is beneficial if data mining
algorithms can manipulate compressed data directly without
uncompressing it.
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
25
Reduction with Histograms
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
26
Reduction with Clustering
Partition data into clusters based on “closeness” in space.
Retain representatives of clusters (centroids) and outliers.
Effectiveness depends upon the distribution of data
Hierarchical clustering is possible (multi-resolution).
A popular data reduction technique:
Divide data into buckets and store representation of buckets
(sum, count, etc.)
•Equi-width (histogram with bars having the same width)
•Equi-depth (histogram with bars having the same height)
•V-Optimal (histogram with least variance ∑(countb*valueb)
•MaxDiff (bucket boundaries defined by user specified threshold)
Related to quantization problem.
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
27
Reduction with Sampling
™Simple random sample without replacement (SRSWOR)
™Simple random sampling with replacement (SRSWR)
™Cluster sample (SRSWOR or SRSWR from clusters)
™Stratified sample (stratum = group based on attribute value)
Random sampling can produce poor results Î active research.
Principles of Knowledge Discovery in Databases
University of Alberta
Principles of Knowledge Discovery in Databases
University of Alberta
28
Data Preprocessing Outline
Allows a large data set to be represented by a much smaller random
sample of the data (sub-set).
•How to select a random sample?
•Will the patterns in the sample represent the patterns in the data?
 Dr. Osmar R. Zaïane, 1999
 Dr. Osmar R. Zaïane, 1999
• What is the motivation behind data preprocessing?
• What is data cleaning and what is it for?
• What is data integration and what is it for?
• What is data transformation and what is it for?
• What is data reduction and what is it for?
• What is data discretization?
• How do we generate concept hierarchies?
29
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
30
5
Discretization
Data Preprocessing Outline
Discretization is used to reduce the number of values for a
given continuous attribute, by dividing the range of the
attribute into intervals. Interval labels are then used to
replace actual data values.
• What is the motivation behind data preprocessing?
• What is data cleaning and what is it for?
• What is data integration and what is it for?
Some data mining algorithms only accept categorical
attributes and cannot handle a range of continuous
attribute value.
• What is data transformation and what is it for?
• What is data reduction and what is it for?
Discretization can reduce the data set, and can also be used
to generate concept hierarchies automatically.
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
31
• What is data discretization?
• How do we generate concept hierarchies?
 Dr. Osmar R. Zaïane, 1999
Discretization and Concept
Hierarchies
Î
 Dr. Osmar R. Zaïane, 1999
Î
Î
Principles of Knowledge Discovery in Databases
University of Alberta
Î
Î
Î
Î
Î
Binning (bin representative (mean/median) recursive binning C.H.)
Histogram analysis (recursive with min interval size C.H.)
Clustering (recursive clustering C.H.)
Entropy-based (binary partitioning with information gain evaluation C.H.)
3-4-5 data segmentation (uniform intervals with rounded boundaries)
Î
32
•Sort data
get Min and Max
•Determine 5%-95% tile
get Low and High
•Determine most significant digit (msd)
get Low’ and High’
•(High’-Low’)/msd
If 3, 6, 7, or 9
partition in 3 intervals
If 2, 4, or 8
partition in 4 intervals
If 1, 5, or 10
partition in 5 intervals
•Adjust both ends intervals to include 100% data
•Repeat recursively
It is difficult to construct concept hierarchies for
numerical attributes.
Î Automatic concept hierarchy generation based on
data distribution analysis.
Î
University of Alberta
3-4-5 Partitioning
For numerical data There is:
•Wide diversity of possible range values
•Frequent updates
Î
Principles of Knowledge Discovery in Databases
33
 Dr. Osmar R. Zaïane, 1999
Principles of Knowledge Discovery in Databases
University of Alberta
34
3-4-5 Partitioning Example
5%
count
95%
Step 1:
-$351
-$159
Min
Low (i.e, 5%-tile)
Step 2:
msd=1,000
Step 3:
profit
$1,838
$4,700
High(i.e, 95%-0 tile)
Low=-$1,000
Max
High=$2,000
(-$1,000 - $2,000)
(-$1,000 - 0)
(0 -$ 1,000)
($1,000 - $2,000)
(-$4000 -$5,000)
Step 4:
(-$400 - 0)
(0 - $200)
(-$400 - -$300)
(-$300 - -$200)
(-$200 - -$100)
($2,000 - $5, 000)
($1,000 - $2, 000)
(0 - $1,000)
($1,000 - $1,200)
($200 - $400)
($2,000 - $3,000)
($1,200 - $1,400)
($3,000 - $4,000)
($400 - $600)
($1,400 - $1,600)
Step 5:
($600 - $800)
(-$100 - 0)
 Dr. Osmar R. Zaïane, 1999
($800 - $1,000)
($4,000 - $5,000)
($1,600 - $1,800)
($1,800 - $2,000)
Principles of Knowledge Discovery in Databases
University of Alberta
35
6