Download Data Preprocessing

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-­‐Champaign & Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved. 1
2014-09-08
2
Chapter 3: Data Preprocessing
n 
Data Preprocessing: An Overview n 
Data Quality n 
Major Tasks in Data Preprocessing n 
Data Cleaning n 
Data IntegraRon n 
Data ReducRon n 
Data TransformaRon and Data DiscreRzaRon n 
Summary 3
Data Quality: Why Preprocess the Data?
n 
Measures for data quality: A mulRdimensional view n 
Accuracy: correct or wrong, accurate or not n 
Completeness: not recorded, unavailable, … n 
Consistency: some modified but some not, dangling, … n 
Timeliness: Rmely update? n 
Believability: how trustable the data are correct? n 
Interpretability: how easily the data can be understood? 4
Major Tasks in Data Preprocessing
n 
n 
n 
n 
Data cleaning n  Fill in missing values, smooth noisy data, idenRfy or remove outliers, and resolve inconsistencies Data integra,on n  IntegraRon of mulRple databases, data cubes, or files Data reduc,on n  Dimensionality reducRon n  Numerosity reducRon n  Data compression Data transforma,on and data discre,za,on n  NormalizaRon n  Concept hierarchy generaRon 5
Chapter 3: Data Preprocessing
n 
Data Preprocessing: An Overview n 
Data Quality n 
Major Tasks in Data Preprocessing n 
Data Cleaning n 
Data IntegraRon n 
Data ReducRon n 
Data TransformaRon and Data DiscreRzaRon n 
Summary 6
Data Cleaning
n 
Data in the Real World Is Dirty: Lots of potenRally incorrect data, e.g., instrument faulty, human or computer error, transmission error n  incomplete: lacking a]ribute values, lacking certain a]ributes of interest, or containing only aggregate data n  e.g., Occupa&on = “ ” (missing data) n  noisy: containing noise, errors, or outliers n  e.g., Salary = “−10” (an error) n  inconsistent: containing discrepancies in codes or names, e.g., n  Age = “42”, Birthday = “03/07/2010” n  Was raRng “1, 2, 3”, now raRng “A, B, C” n  discrepancy between duplicate records n  IntenRonal (e.g., disguised missing data) n  Jan. 1 as everyone’s birthday? 7
Incomplete (Missing) Data
n 
Data is not always available n 
n 
Missing data may be due to n 
equipment malfuncRon n 
inconsistent with other recorded data and thus deleted n 
data not entered due to misunderstanding n 
n 
n 
E.g., many tuples have no recorded value for several a]ributes, such as customer income in sales data certain data may not be considered important at the Rme of entry not register history or changes of the data Missing data may need to be inferred 8
How to Handle Missing Data?
n 
Ignore the tuple: usually done when class label is missing (when doing classificaRon)—not effecRve when the % of missing values per a]ribute varies considerably n 
Fill in the missing value manually: tedious + infeasible? n 
Fill in it automaRcally with n 
a global constant : e.g., “unknown”, a new class?! n 
the a]ribute mean n 
n 
the a]ribute mean for all samples belonging to the same class: smarter the most probable value: inference-­‐based such as Bayesian formula or decision tree 9
Noisy Data
n 
n 
n 
Noise: random error or variance in a measured variable Incorrect a]ribute values may be due to n  faulty data collecRon instruments n  data entry problems n  data transmission problems n  technology limitaRon n  inconsistency in naming convenRon Other data problems which require data cleaning n  duplicate records n  incomplete data n  inconsistent data 10
How to Handle Noisy Data?
n 
n 
n 
n 
Binning n  first sort data and parRRon into (equal-­‐frequency) bins n  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression n  smooth by fikng the data into regression funcRons Clustering n  detect and remove outliers Combined computer and human inspecRon n  detect suspicious values and check by human (e.g., deal with possible outliers) 11
Data Cleaning as a Process
n 
n 
n 
Data discrepancy detecRon n  Use metadata (e.g., domain, range, dependency, distribuRon) n  Check field overloading n  Check uniqueness rule, consecuRve rule and null rule n  Use commercial tools n  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-­‐check) to detect errors and make correcRons n  Data audiRng: by analyzing data to discover rules and relaRonship to detect violators (e.g., correlaRon and clustering to find outliers) Data migraRon and integraRon n  Data migraRon tools: allow transformaRons to be specified n  ETL (ExtracRon/TransformaRon/Loading) tools: allow users to specify transformaRons through a graphical user interface IntegraRon of the two processes n  IteraRve and interacRve (e.g., Po]er’s Wheels) 12
Chapter 3: Data Preprocessing
n 
Data Preprocessing: An Overview n 
Data Quality n 
Major Tasks in Data Preprocessing n 
Data Cleaning n 
Data IntegraRon n 
Data ReducRon n 
Data TransformaRon and Data DiscreRzaRon n 
Summary 13
Data Integration
n 
Data integra,on: n 
n 
Schema integraRon: e.g., A.cust-­‐id ≡ B.cust-­‐# n 
n 
Combines data from mulRple sources into a coherent store Integrate metadata from different sources EnRty idenRficaRon problem: n 
IdenRfy real world enRRes from mulRple data sources, e.g., Bill Clinton = William Clinton n 
DetecRng and resolving data value conflicts n 
For the same real world enRty, a]ribute values from different sources are different n 
Possible reasons: different representaRons, different scales, e.g., metric vs. BriRsh units 14
Handling Redundancy in Data Integration
n 
Redundant data occur ooen when integraRon of mulRple databases n 
n 
n 
n 
Object iden&fica&on: The same a]ribute or object may have different names in different databases Derivable data: One a]ribute may be a “derived” a]ribute in another table, e.g., annual revenue Redundant a]ributes may be able to be detected by correla&on analysis and covariance analysis Careful integraRon of the data from mulRple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality 15
Correlation Analysis (Nominal Data)
n 
Χ2 (chi-­‐square) test 2
(
Observed
−
Expected
)
χ2 = ∑
Expected
n 
n 
n 
The larger the Χ2 value, the more likely the variables are related The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count CorrelaRon does not imply causality n 
# of hospitals and # of car-­‐theo in a city are correlated n 
Both are causally linked to the third variable: populaRon 16
Chi-Square Calculation: An Example
n 
Play chess
Not play chess
Sum (row)
Like science fiction
250(90)
200(360)
450
Not like science fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
Χ2 (chi-­‐square) calculaRon (numbers in parenthesis are expected counts calculated based on the data distribuRon in the two categories) (250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
χ =
+
+
+
= 507.93
90
210
360
840
2
n 
It shows that like_science_ficRon and play_chess are correlated in the group 17
Correlation Analysis (Numeric Data)
n 
CorrelaRon coefficient (also called Pearson’s product moment coefficient) n
rA, B =
n 
n 
∑i=1 (ai − A)(bi − B)
(n − 1)σ Aσ B
n
∑
=
i =1
(ai bi ) − n AB
(n − 1)σ Aσ B
where n is the number of tuples, A and B are the respecRve means of A and B, σA and σB are the respecRve standard deviaRon of A and B, and Σ(aibi) is the sum of the AB cross-­‐
product. If rA,B > 0, A and B are posiRvely correlated (A’s values increase as B’s). The higher, the stronger correlaRon. rA,B = 0: independent; rAB < 0: negaRvely correlated 18
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
19
Correlation (viewed as linear
relationship)
n 
n 
CorrelaRon measures the linear relaRonship between objects To compute correlaRon, we standardize data objects, A and B, and then take their dot product a'k = (ak − mean( A)) / std ( A)
b'k = (bk − mean( B)) / std ( B)
correlation( A, B) = A'•B'
20
Covariance (Numeric Data)
n 
Covariance is similar to correlaRon CorrelaRon coefficient: where n is the number of tuples, A
and B
are the respecRve mean or expected values of A and B, σA and σB are the respecRve standard deviaRon of A and B n 
n 
n 
Posi,ve covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values Nega,ve covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value Independence: CovA,B = 0 but the converse is not true: n 
Some pairs of random variables may have a covariance of 0 but are not independent. Only under some addiRonal assumpRons (e.g., the data follow mulRvariate normal distribuRons) does a covariance of 0 imply independence 21
Co-Variance: An Example
n 
It can be simplified in computaRon as n 
Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). n 
QuesRon: If the stocks are affected by the same industry trends, will their prices rise or fall together? n 
n 
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 n 
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 n 
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 Thus, A and B rise together since Cov(A, B) > 0. Chapter 3: Data Preprocessing
n 
Data Preprocessing: An Overview n 
Data Quality n 
Major Tasks in Data Preprocessing n 
Data Cleaning n 
Data IntegraRon n 
Data ReducRon n 
Data TransformaRon and Data DiscreRzaRon n 
Summary 23
Data Reduction Strategies
n 
n 
n 
Data reduc,on: Obtain a reduced representaRon of the data set that is much smaller in volume but yet produces the same (or almost the same) analyRcal results Why data reducRon? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long Rme to run on the complete data set. Data reducRon strategies n  Dimensionality reducRon, e.g., remove unimportant a]ributes n  Wavelet transforms n  Principal Components Analysis (PCA) n  Feature subset selecRon, feature creaRon n  Numerosity reducRon (some simply call it: Data ReducRon) n  Regression and Log-­‐Linear Models n  Histograms, clustering, sampling n  Data cube aggregaRon n  Data compression 24
Data Reduction 1: Dimensionality
Reduction
n 
n 
n 
Curse of dimensionality n  When dimensionality increases, data becomes increasingly sparse n  Density and distance between points, which is criRcal to clustering, outlier analysis, becomes less meaningful n  The possible combinaRons of subspaces will grow exponenRally Dimensionality reduc,on n  Avoid the curse of dimensionality n  Help eliminate irrelevant features and reduce noise n  Reduce Rme and space required in data mining n  Allow easier visualizaRon Dimensionality reduc,on techniques n  Wavelet transforms n  Principal Component Analysis n  Supervised and nonlinear techniques (e.g., feature selecRon) 25
Mapping Data to a New Space
Fourier transform n  Wavelet transform n 
Two Sine Waves
Two Sine Waves + Noise
Frequency
26
What Is Wavelet Transform?
n 
Decomposes a signal into different frequency subbands n 
n 
n 
n 
Applicable to n-­‐dimensional signals Data are transformed to preserve relaRve distance between objects at different levels of resoluRon Allow natural clusters to become more disRnguishable Used for image compression 27
Wavelet Transformation
Haar2
n 
n 
n 
n 
Daubechie4
Discrete wavelet transform (DWT) for linear signal processing, mulR-­‐resoluRon analysis Compressed approximaRon: store only a small fracRon of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but be]er lossy compression, localized in space Method: n 
Length, L, must be an integer power of 2 (padding with 0’s, when necessary) n 
Each transform has 2 funcRons: smoothing, difference n 
Applies to pairs of data, resulRng in two set of data of length L/2 n 
Applies two funcRons recursively, unRl reaches the desired length 28
Wavelet Decomposition
n 
n 
n 
Wavelets: A math tool for space-­‐efficient hierarchical decomposiRon of funcRons S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -­‐11/4, 1/ , 0, 0, -­‐1, -­‐1, 0] 2
Compression: many small detail coefficients can be replaced by 0’s, and only the significant coefficients are retained 29
Why Wavelet Transform?
n 
n 
n 
n 
n 
Use hat-­‐shape filters n  Emphasize region where points cluster n  Suppress weaker informaRon in their boundaries EffecRve removal of outliers n  InsensiRve to noise, insensiRve to input order MulR-­‐resoluRon n  Detect arbitrary shaped clusters at different scales Efficient n  Complexity O(N) Only applicable to low dimensional data 30
Principal Component Analysis (PCA)
n 
n 
Find a projecRon that captures the largest amount of variaRon in data The original data are projected onto a much smaller space, resulRng in dimensionality reducRon. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space x2
e
x1
31
Principal Component Analysis (Steps)
n 
Given N data vectors from n-­‐dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data n 
Normalize input data: Each a]ribute falls within the same range n 
Compute k orthonormal (unit) vectors, i.e., principal components n 
n 
n 
n 
Each input data (vector) is a linear combinaRon of the k principal component vectors The principal components are sorted in order of decreasing “significance” or strength Since the components are sorted, the size of the data can be reduced by eliminaRng the weak components, i.e., those with low variance (i.e., using the strongest principal components, it is possible to reconstruct a good approximaRon of the original data) Works for numeric data only 32
Attribute Subset Selection
n 
Another way to reduce dimensionality of data n 
Redundant a]ributes n 
n 
n 
Duplicate much or all of the informaRon contained in one or more other a]ributes E.g., purchase price of a product and the amount of sales tax paid Irrelevant a]ributes n 
n 
Contain no informaRon that is useful for the data mining task at hand E.g., students' ID is ooen irrelevant to the task of predicRng students' GPA 33
Heuristic Search in Attribute Selection
n 
n 
There are 2d possible a]ribute combinaRons of d a]ributes Typical heurisRc a]ribute selecRon methods: n  Best single a]ribute under the a]ribute independence assumpRon: choose by significance tests n  Best step-­‐wise feature selecRon: n  The best single-­‐a]ribute is picked first n  Then next best a]ribute condiRon to the first, ... n  Step-­‐wise a]ribute eliminaRon: n  Repeatedly eliminate the worst a]ribute n  Best combined a]ribute selecRon and eliminaRon n  OpRmal branch and bound: n  Use a]ribute eliminaRon and backtracking 34
Attribute Creation (Feature Generation)
n 
n 
Create new a]ributes (features) that can capture the important informaRon in a data set more effecRvely than the original ones Three general methodologies n  A]ribute extracRon n  Domain-­‐specific n  Mapping data to new space (see: data reducRon) n  E.g., Fourier transformaRon, wavelet transformaRon, manifold approaches (not covered) n  A]ribute construcRon n  Combining features (see: discriminaRve frequent pa]erns in Chapter on “Advanced ClassificaRon”) n  Data discreRzaRon 35
Data Reduction 2: Numerosity
Reduction
n 
n 
n 
Reduce data volume by choosing alternaRve, smaller forms of data representaRon Parametric methods (e.g., regression) n  Assume the data fits some model, esRmate model parameters, store only the parameters, and discard the data (except possible outliers) n  Ex.: Log-­‐linear models—obtain value at a point in m-­‐D space as the product on appropriate marginal subspaces Non-­‐parametric methods n  Do not assume models n  Major families: histograms, clustering, sampling, … 36
Parametric Data Reduction:
Regression and Log-Linear Models
n 
n 
n 
Linear regression n  Data modeled to fit a straight line n  Ooen uses the least-­‐square method to fit the line Mul,ple regression n  Allows a response variable Y to be modeled as a linear funcRon of mulRdimensional feature vector Log-­‐linear model n  Approximates discrete mulRdimensional probability distribuRons 37
Regression Analysis
y
Y1
n 
Regression analysis: A collecRve name for techniques for the modeling and analysis of Y1’
y=x+1
numerical data consisRng of values of a dependent variable (also called response variable or measurement) and of one or more X1
x
independent variables (aka. explanatory variables or predictors) n 
The parameters are esRmated so as to give a "best fit" of the data n 
Most commonly the best fit is evaluated by using the least squares method, but other n 
Used for predicRon (including forecasRng of Rme-­‐series data), inference, hypothesis tesRng, and modeling of causal relaRonships criteria have also been used 38
Regress Analysis and Log-Linear
Models
n 
Linear regression: Y = w X + b n 
n 
n 
Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. MulRple regression: Y = b0 + b1 X1 + b2 X2 n 
n 
Two regression coefficients, w and b, specify the line and are to be esRmated by using the data at hand Many nonlinear funcRons can be transformed into the above Log-­‐linear models: n 
n 
n 
Approximate discrete mulRdimensional probability distribuRons EsRmate the probability of each point (tuple) in a mulR-­‐dimensional space for a set of discreRzed a]ributes, based on a smaller subset of dimensional combinaRons Useful for dimensionality reducRon and data smoothing 39
Histogram Analysis
25
Equal-­‐width: equal bucket range 20
Equal-­‐frequency (or equal-­‐
depth) 10
15
100000
90000
80000
70000
0
60000
5
50000
n 
30
40000
n 
35
30000
ParRRoning rules: 40
20000
n 
Divide data into buckets and store average (sum) for each bucket 10000
n 
40
Clustering
n 
n 
n 
n 
n 
ParRRon data set into clusters based on similarity, and store cluster representaRon (e.g., centroid and diameter) only Can be very effecRve if data is clustered but not if data is “smeared” Can have hierarchical clustering and be stored in mulR-­‐
dimensional index tree structures There are many choices of clustering definiRons and clustering algorithms Cluster analysis will be studied in depth in Chapter 10 41
Sampling
n 
n 
n 
Sampling: obtaining a small sample s to represent the whole data set N Allow a mining algorithm to run in complexity that is potenRally sub-­‐linear to the size of the data Key principle: Choose a representaRve subset of the data n 
n 
n 
Simple random sampling may have very poor performance in the presence of skew Develop adapRve sampling methods, e.g., straRfied sampling: Note: Sampling may not reduce database I/Os (page at a Rme) 42
Types of Sampling
n 
n 
n 
n 
Simple random sampling n  There is an equal probability of selecRng any parRcular item Sampling without replacement n  Once an object is selected, it is removed from the populaRon Sampling with replacement n  A selected object is not removed from the populaRon Stra,fied sampling: n  ParRRon the data set, and draw samples from each parRRon (proporRonally, i.e., approximately the same percentage of the data) n  Used in conjuncRon with skewed data 43
Sampling: With or without Replacement
R
O
W
SRS le random t
p
(sim le withou
samp ment)
e
c
a
l
p
re
SRSW
R
Raw Data
44
Sampling: Cluster or Stratified
Sampling
Raw Data
Cluster/Stratified Sample
45
Data Cube Aggregation
n 
n 
The lowest level of a data cube (base cuboid) n 
The aggregated data for an individual enRty of interest n 
E.g., a customer in a phone calling data warehouse MulRple levels of aggregaRon in data cubes n 
n 
Reference appropriate levels n 
n 
Further reduce the size of data to deal with Use the smallest representaRon which is enough to solve the task Queries regarding aggregated informaRon should be answered using data cube, when possible 46
Data Reduction 3: Data Compression
n 
n 
n 
n 
String compression n  There are extensive theories and well-­‐tuned algorithms n  Typically lossless, but only limited manipulaRon is possible without expansion Audio/video compression n  Typically lossy compression, with progressive refinement n  SomeRmes small fragments of signal can be reconstructed without reconstrucRng the whole Time sequence is not audio n  Typically short and vary slowly with Rme Dimensionality and numerosity reducRon may also be considered as forms of data compression 47
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
48
Chapter 3: Data Preprocessing
n 
Data Preprocessing: An Overview n 
Data Quality n 
Major Tasks in Data Preprocessing n 
Data Cleaning n 
Data IntegraRon n 
Data ReducRon n 
Data TransformaRon and Data DiscreRzaRon n 
Summary 49
Data Transformation
n 
n 
A funcRon that maps the enRre set of values of a given a]ribute to a new set of replacement values s.t. each old value can be idenRfied with one of the new values Methods n 
Smoothing: Remove noise from data n 
A]ribute/feature construcRon n 
New a]ributes constructed from the given ones n 
AggregaRon: SummarizaRon, data cube construcRon n 
NormalizaRon: Scaled to fall within a smaller, specified range n 
n 
min-­‐max normalizaRon n 
z-­‐score normalizaRon n 
normalizaRon by decimal scaling DiscreRzaRon: Concept hierarchy climbing 50
Normalization
n 
Min-­‐max normaliza,on: to [new_minA, new_maxA] v' =
n 
n 
v − minA
(new _ maxA − new _ minA) + new _ minA
maxA − minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. 73,600 − 12,000
(1.0 − 0) + 0 = 0.716
Then $73,000 is mapped to 98,000 − 12,000
Z-­‐score normaliza,on (μ: mean, σ: standard deviaRon): v' =
v − µA
σ
A
n 
n 
Ex. Let μ = 54,000, σ = 16,000. Then 73,600 − 54,000
= 1.225
16,000
Normaliza,on by decimal scaling v
v'= j
10
Where j is the smallest integer such that Max(|ν’|) < 1
51
Discretization
n 
n 
Three types of a]ributes n 
Nominal—values from an unordered set, e.g., color, profession n 
Ordinal—values from an ordered set, e.g., military or academic rank n 
Numeric—real numbers, e.g., integer or real numbers DiscreRzaRon: Divide the range of a conRnuous a]ribute into intervals n 
Interval labels can then be used to replace actual data values n 
Reduce data size by discreRzaRon n 
Supervised vs. unsupervised n 
Split (top-­‐down) vs. merge (bo]om-­‐up) n 
DiscreRzaRon can be performed recursively on an a]ribute n 
Prepare for further analysis, e.g., classificaRon 52
Data Discretization Methods
n 
Typical methods: All the methods can be applied recursively n 
Binning n 
n 
Histogram analysis n 
n 
n 
n 
Top-­‐down split, unsupervised Top-­‐down split, unsupervised Clustering analysis (unsupervised, top-­‐down split or bo]om-­‐
up merge) Decision-­‐tree analysis (supervised, top-­‐down split) CorrelaRon (e.g., χ2) analysis (unsupervised, bo]om-­‐up merge) 53
Simple Discretization: Binning
n 
Equal-­‐width (distance) parRRoning n 
Divides the range into N intervals of equal size: uniform grid n 
if A and B are the lowest and highest values of the a]ribute, the width of intervals will be: W = (B –A)/N. n 
n 
The most straighˆorward, but outliers may dominate presentaRon n 
Skewed data is not handled well Equal-­‐depth (frequency) parRRoning n 
Divides the range into N intervals, each containing approximately same number of samples n 
Good data scaling n 
Managing categorical a]ributes can be tricky 54
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * ParRRon into equal-­‐frequency (equi-­‐depth) bins: -­‐ Bin 1: 4, 8, 9, 15 -­‐ Bin 2: 21, 21, 24, 25 -­‐ Bin 3: 26, 28, 29, 34 * Smoothing by bin means: -­‐ Bin 1: 9, 9, 9, 9 -­‐ Bin 2: 23, 23, 23, 23 -­‐ Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: -­‐ Bin 1: 4, 4, 4, 15 -­‐ Bin 2: 21, 21, 25, 25 -­‐ Bin 3: 26, 26, 26, 34 q 
55
Labels
(Binning vs. Clustering)
Data
Equal frequency (binning)
Equal interval width (binning)
K-means clustering leads to better results
56
Discretization by Classification &
Correlation Analysis
n 
n 
Classification (e.g., decision tree analysis)
n 
Supervised: Given class labels, e.g., cancerous vs. benign
n 
Using entropy to determine split point (discretization point)
n 
Top-down, recursive split
n 
Details to be covered in Chapter “Classification”
Correlation analysis (e.g., Chi-merge: χ2-based discretization)
n 
Supervised: use class information
n 
Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to
merge
n 
Merge performed recursively, until a predefined stopping condition
57
Concept Hierarchy Generation
n 
n 
n 
n 
n 
Concept hierarchy organizes concepts (i.e., a]ribute values) hierarchically and is usually associated with each dimension in a data warehouse Concept hierarchies facilitate drilling and rolling in data warehouses to view data in mulRple granularity Concept hierarchy formaRon: Recursively reduce the data by collecRng and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) Concept hierarchies can be explicitly specified by domain experts and/
or data warehouse designers Concept hierarchy can be automaRcally formed for both numeric and nominal data—For numeric data, use discreRzaRon methods shown 58
Concept Hierarchy Generation
for Nominal Data
n 
SpecificaRon of a parRal/total ordering of a]ributes explicitly at the schema level by users or experts n 
n 
SpecificaRon of a hierarchy for a set of values by explicit data grouping n 
n 
{Urbana, Champaign, Chicago} < Illinois SpecificaRon of only a parRal set of a]ributes n 
n 
street < city < state < country E.g., only street < city, not others AutomaRc generaRon of hierarchies (or a]ribute levels) by the analysis of the number of disRnct values n 
E.g., for a set of a]ributes: {street, city, state, country} 59
Automatic Concept Hierarchy Generation
n 
Some hierarchies can be automaRcally generated based on the analysis of the number of disRnct values per a]ribute in the data set n  The a]ribute with the most disRnct values is placed at the lowest level of the hierarchy n  ExcepRons, e.g., weekday, month, quarter, year country
15 distinct values
province_or_ state
365 distinct values
city
3567 distinct values
street
674,339 distinct values
60
Chapter 3: Data Preprocessing
n 
Data Preprocessing: An Overview n 
Data Quality n 
Major Tasks in Data Preprocessing n 
Data Cleaning n 
Data IntegraRon n 
Data ReducRon n 
Data TransformaRon and Data DiscreRzaRon n 
Summary 61
Summary
n 
Data quality: accuracy, completeness, consistency, Rmeliness, believability, interpretability n 
Data cleaning: e.g. missing/noisy values, outliers n 
Data integra,on from mulRple sources: n 
n 
Data reduc,on n 
n 
EnRty idenRficaRon problem; Remove redundancies; Detect inconsistencies Dimensionality reducRon; Numerosity reducRon; Data compression Data transforma,on and data discre,za,on n 
NormalizaRon; Concept hierarchy generaRon 62
References
n 
n 
n 
n 
n 
n 
n 
n 
n 
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-­‐78, 1999 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003 T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality Browser. SIGMOD’02 H. V. Jagadish et al., Special Issue on Data ReducRon Techniques. BulleRn of the Technical Commi]ee on Data Engineering, 20(4), Dec. 1997 D. Pyle. Data PreparaRon for Data Mining. Morgan Kaufmann, 1999 E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulle&n of the Technical CommiQee on Data Engineering. Vol.23, No.4 V. Raman and J. Hellerstein. Po]ers Wheel: An InteracRve Framework for Data Cleaning and TransformaRon, VLDB’2001 T. Redman. Data Quality: Management and Technology. Bantam Books, 1992 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-­‐640, 1995 63
2014-09-08
64