Download 002 Descriptive Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Transcript
25.02.16
Knowledge Discovery (KDD) Process
• This is a view from typical
database systems and data
warehousing communities
• Data mining plays an essential
role in the knowledge discovery
process
DataMining MTAT.03.183
Descriptive analysis,preprocessing,
visualisation...
Pattern Evaluation
Data Mining
Task-relevant Data
JaakVilo
2016Spring
Selection
Data Warehouse
Data Cleaning
Data Integration
Jiawei Han, Micheline
Databases
Kamber, and Jian Pei
Data in the real world is dirty
n incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
n
n
n
n
n
n
e.g., occupation=“ ”
n
n
n
February 25, 2016
n
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Data Mining: Concepts and Techniques
n
n
n
Quality decisions must be based on quality data
n
n
e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
n
Data warehouse needs consistent integration of quality
data
n
n
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
n
n
n
Garbage in, garbage out
n
n
February 25, 2016
Data Mining: Concepts and Techniques
n
Errors in data transmission
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Data Mining: Concepts and Techniques
4
High quality data requirements
No quality data, no quality mining results!
n
problems
Inconsistent data may come from
February 25, 2016
Why Is Data Preprocessing Important?
n
Human/hardware/software
Faulty data collection instruments
Human or computer error at data entry
n
n
Different considerations between the time when the data was
collected and when it is analyzed.
n
n
3
“Not applicable” data value when collected
Noisy data (incorrect values) may come from
n
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes
or names
n
Incomplete data may come from
n
noisy: containing errors or outliers
n
2
Why Is Data Dirty?
Why Data Preprocessing?
n
DataMining:Concepts andTechniques
5
High-quality data needs to pass a set of quality criteria. Those
include:
Accuracy: an aggregated value over the criteria of integrity,
consistency, and density
Integrity: an aggregated value over the criteria of completeness and
validity
Completeness: achieved by correcting data containing anomalies
Validity: approximated by the amount of data satisfying integrity
constraints
Consistency: concerns contradictions and syntactical anomalies
Uniformity: directly related to irregularities and in compliance with
the set 'unit of measure'
Density: the quotient of missing values in the data and the number
of total values ought to be known
http://en.wikipedia.o rg/ wik i/Dat a_cle ansi ng
n
February
25, 2016
Data Mining: Concepts and Techniques
6
1
25.02.16
Multi-Dimensional Measure of Data Quality
n
n
February 25, 2016
Data Mining: Concepts and Techniques
7
A well-accepted multidimensional view:
n Accuracy
n Completeness
n Consistency
n Timeliness
n Believability
n Value added
n Interpretability
n Accessibility
Broad categories:
n Intrinsic, contextual, representational, and accessibility
February 25, 2016
Major Tasks in Data Preprocessing
n
8
Forms of Data Preprocessing
Data cleaning
n
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
n
Data integration
n
Data transformation
n
Data reduction
n
n
n
n
Data Mining: Concepts and Techniques
Integration of multiple databases, data cubes, or files
Normalization and aggregation
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
n
Part of data reduction but with particular importance, especially
for numerical data
February 25, 2016
Data Mining: Concepts and Techniques
9
February 25, 2016
Data Mining: Concepts and Techniques
10
Chapter 2: Data Preprocessing
n
Why preprocess the data?
n
Descriptive data summarization
n
Data cleaning
n
Data integration and transformation
n
Data reduction
n
Discretization and concept hierarchy generation
n
Summary
• 0.480.030.060.050.430.190.160.350.250.07
0.290.140.960.020.110.220.800.050.540.36
0.230.280.020.100.480.310.360.210.330.45
0.640.040.480.560.160.580.330.110.420.06
0.000.230.240.000.540.020.260.200.180.01
0.170.170.040.970.250.040.340.010.500.15
0.430.050.500.160.520.820.230.090.020.21
0.130.170.330.260.000.330.570.430.090.43
0.240.080.080.540.080.020.020.010.350.62
0.100.030.140.780.300.070.080.480.570.30
JaakViloandotherauthors
February 25, 2016
Data Mining: Concepts and Techniques
U T:DataMining2009
12
11
2
25.02.16
1,20#
1,20#
1,00#
1,00#
0,80#
0,80#
0,60#
0,60#
Series1#
Series1#
0,40#
0,40#
0,20#
0,20#
0,00#
0#
20#
40#
60#
80#
100#
120#
0,00#
0#
20#
40#
60#
80#
100#
120#
JaakViloandotherauthors
1,20#
15
1,00#
U T:DataMining2009
14
0,80#
JaakViloandotherauthors
U T:DataMining2009
0,60#
JaakViloandotherauthors
0,40#
13
0,20#
U T:DataMining2009
0,00#
JaakViloandotherauthors
0#
20#
40#
60#
80#
100#
120#
Series1#
n
use Big_University_DB
mine characteristics as "Science_Students"
in relevance to
name,gender,major,birth_date,residence,pho
ne#,gpa
from student
where statusin graduate
Motivation
n
To better understand the data: central tendency, variation
and spread
n
Data dispersion characteristics
n
Numerical dimensions correspond to sorted intervals
n
n
U T:DataMining2009
16
Mining Data Descriptive Characteristics
Characterisedata
JaakViloandotherauthors
U T:DataMining2009
median, max, min, quantiles, outliers, variance, etc.
n
Data dispersion: analyzed with multiple granularities of
precision
n
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
n
Folding measures into numerical dimensions
n
Boxplot or quantile analysis on the transformed cube
17
February 25, 2016
Data Mining: Concepts and Techniques
18
3
25.02.16
Measuring the Central Tendency
n
Mean (algebraic measure) (sample vs. population):
n
Weighted arithmetic mean:
1 n
∑ xi
n i =1
x=
n
n
Trimmed mean: chopping extreme values
x=
Median: A holistic measure
n
x
•
•
N
∑w x
i
n
µ=∑
i
i =1
n
•
∑ wi
i =1
Middle value if odd number of values, or average of the middle two
•
values otherwise
n
n
Estimated by interpolation (for grouped data):
median = L1 + (
Mode
n
Value that occurs most frequently in the data
n
Unimodal, bimodal, trimodal
n
Empirical formula:
February 25, 2016
JaakViloandotherauthors
n / 2 − (∑ f )l
f median
)c
H is tograms and Probability D ens ityFunctions
Probability Dens ity Functions
– Totalareaundercurveintegratesto1
Frequency His tograms
– Y-axis is counts
– Simpleinterpretation
– Can'tbedirectlyrelatedtoprobabilities ordens ityfunctions
Relative Frequency His tograms
– Dividecounts bytotalnumberofobs ervations
– Y-axis is relativefrequency
– Canbeinterpretedas probabilitiesforeachrange
– Can'tbedirectlyrelatedtodensityfunction
•
•
B ar h ei gh ts su m to 1 b u t wo n ' t i n tegrate to 1 u n l ess b ar wi d th = 1
Dens ity His tograms
– Dividecounts by(totalnumberofobservations Xbarwidth)
– Y-axis is densityvalues
–
–
BarheightXbarwidthgivesprobabilityforeachrange
Canbedirectlyrelatedtodensityfunction
•
B ar areas
su m to 1
mean − mode = 3 × (mean − median )
Data Mining: Concepts and Techniques
U T:DataMining2009
h ttp ://www.g e o g .u c sb .e du /~jo e /l g 21 0 _ w07 /l ec ture _ no te s/l ec t04 /o h0 7 _0 4 _1 .h m
tl
JaakViloandotherauthors
U T:DataMining2009
20
JaakViloandotherauthors
U T:DataMining2009
22
19
21
histograms
• equalsub-intervals,knownas`bins‘
• breakpoints
• binwidth
JaakViloandotherauthors
U T:DataMining2009
23
JaakViloandotherauthors
U T:DataMining2009
24
4
25.02.16
The data are (the log of) wing spans of aircraft built in from 1956 - 1984.
JaakViloandotherauthors
U T:DataMining2009
25
JaakViloandotherauthors
U T:DataMining2009
26
2-dimensionaldata:dot-plot
JaakViloandotherauthors
U T:DataMining2009
27
DataMining:Concepts andTechniques
February 25, 2016
28
Histogramvskerneldensity
• propertiesofhistogramswiththesetwoexamples:
– they are notsmooth
– dependonendpointsofbins
– dependonwidthofbins
• Wecanalleviatethefirsttwoproblemsbyusing
kerneldensityestimators.
• Toremovethedependenceontheendpointsofthe
bins,wecentreeachoftheblocksateachdatapoint
ratherthanfixingtheendpointsoftheblocks.
JaakViloandotherauthors
U T:DataMining2009
29
block of width 1 and height 1/12 (the dotted boxes) as they
are 12 data points, and then add them up
JaakViloandotherauthors
U T:DataMining2009
30
5
25.02.16
• It'simportanttochoosethemostappropriate
bandwidthasavaluethatistoosmallortoolargeis
notuseful.
• Ifweuseanormal(Gaussian)kernelwithbandwidth
orstandarddeviationof0.1(whichhasarea1/12
undertheeachcurve)thenthekerneldensity
estimateissaidtoundersmoothedasthebandwidth
istoosmallinthefigurebelow.
• Itappearsthatthereare4modesinthisdensitysomeofthesearesurelyartificesofthedata.
• Blocks- itisstilldiscontinuous aswehave
usedadiscontinuous kernelasourbuilding
block
• Ifweuseasmoothkernelforourbuilding
block,thenwewillhaveasmoothdensity
estimate.
JaakViloandotherauthors
U T:DataMining2009
31
JaakViloandotherauthors
U T:DataMining2009
32
JaakViloandotherauthors
U T:DataMining2009
33
JaakViloandotherauthors
U T:DataMining2009
34
• Theoptimalvalueofthebandwidthforourdatasetis
about0.25.
• Fromtheoptimallysmoothedkerneldensity
estimate,therearetwomodes.Asthesearethelog
ofaircraftwingspan,itmeansthattherewerea
groupofsmaller,lighterplanesbuilt,andtheseare
clusteredaround2.5(whichisabout12m).
• Whereasthelargerplanes,maybeusingjetengines
astheseusedonacommercialscalefromaboutthe
1960s,aregroupedaround3.5(about33m).
• Chooseoptimalbandwith
–Methods to estimateit
• AMISE =AsymptoticMean Integrated Squared Error
• optimalbandwidth=arg minAMISE
JaakViloandotherauthors
U T:DataMining2009
35
JaakViloandotherauthors
U T:DataMining2009
36
6
25.02.16
• Thepropertiesofkerneldensityestimators
are,ascomparedtohistograms:
– smooth
– noendpoints
– dependonbandwidth
JaakViloandotherauthors
U T:DataMining2009
37
JaakViloandotherauthors
U T:DataMining2009
38
39
JaakViloandotherauthors
U T:DataMining2009
40
KernelDensityestimation- links
• Wikipedia:
http://en.wikipedia.org/wiki/Kernel_density_estimation
• RicardoGutierrez-Osuna
http://research.cs.tamu.edu/prism/lectures/pr/pr_l7.pdf
• TutorialandJavaappletfortesting:
– http://parallel.vub.ac.be/research/causalModels/tutorial /kde.html
• SimpleinteractivewebdemoatU.Tartu:
– https://courses.cs.ut.ee/demos/kernel-density-estimation/
– http://kde.tume-maailm.pri.ee/
–
JaakViloandotherauthors
U T:DataMining2009
R– example(dueK.Tretjakov)
http://parallel.vub.ac.be/research/causalModels/tutorial /kde.html
d=c(1,2,2,2,2,1,2,2,2,3,2,3,4,5,4,3,2,3,4,4,5,6,7);
kernelsmooth<- function(data,sigma,x){
result=0;
for(dindata){
result=result+exp(-(x-d)^2/2/sigma^2);
}
result/sqrt(2*pi)/sigma;
}
x= seq(min(d),max(d),by=0.1);
y= sapply(x,function(x){kernelsmooth(d,1,x)});
hist(d);
lines(x,y);
JaakViloandotherauthors
U T:DataMining2009
41
JaakViloandotherauthors
U T:DataMining2009
42
7
25.02.16
JaakViloandotherauthors
U T:DataMining2009
43
http://kde.tume-maailm.pri.ee/
JaakViloandotherauthors
JaakViloandotherauthors
U T:DataMining2009
44
U T:DataMining2009
46
http://kde.tume-maailm.pri.ee/
U T:DataMining2009
45
JaakViloandotherauthors
Aggregation,analysisand
visualizationofgeodata
http://kde.tume-maailm.pri.ee/
• http://sightsmap.com
• Severallarge crowd-sourced datasets:
– The whole Panoramio photobank used byGoogle maps
– The whole Wikipedia, geotags andwikipedia article logs
– Foursquare
– ...More
• Aggregate data,calculate popularites of places, calculate typetags
• Translate and categorizetitles and descriptions togettypes
• Improveaggregation algorithms bylearning
• Visualize heatmaps, typetags, aggregatedsources
JaakViloandotherauthors
U T:DataMining2009
47
By: Tanel Tammet
8
25.02.16
By: Tanel Tammet
By: Tanel Tammet
By: Tanel Tammet
By: Tanel Tammet
By: Tanel Tammet
JaakViloandotherauthors
U T:DataMining2009
54
9
25.02.16
• http://176.32.89.45/~hideaki/res/kernel.html
http://jmlr.csail.mit.edu/proceedings/papers/v2/kontkanen07a/kontkanen07a.pdf
JaakViloandotherauthors
U T:DataMining2009
55
JaakViloandotherauthors
U T:DataMining2009
56
JaakViloandotherauthors
U T:DataMining2009
57
JaakViloandotherauthors
U T:DataMining2009
58
MorelinksonRandkerneldensity
• Rtutorial
• http://en.wikipedia.org/wiki/Kernel_density_estimation
– http://cran.r-project.org/doc/manuals/Rintro.html
– http://www.google.com/search?q=R+tutorial
• http://sekhon.berkeley.edu/stats/html/density.html
• http://stat.ethz.ch/R-manual/Rpatched/library/stats/html/density.html
• http://www.google.com/search?q=kernel+density+estimation+R
JaakViloandotherauthors
U T:DataMining2009
59
JaakViloandotherauthors
U T:DataMining2009
60
10
25.02.16
JaakViloandotherauthors
U T:DataMining2009
61
JaakViloandotherauthors
U T:DataMining2009
62
JaakViloandotherauthors
U T:DataMining2009
63
JaakViloandotherauthors
U T:DataMining2009
64
Symmetric vs. Skewed Data
n
Median, mean and mode of
symmetric, positively and
negatively skewed data
Author:
Gea Pajula
JaakViloandotherauthors
U T:DataMining2009
65
February 25, 2016
Data Mining: Concepts and Techniques
66
11
25.02.16
Boxplot Analysis
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
n
n
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
n
Inter-quartile range: IQR = Q3 – Q1
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and
plot outlier individually
n
Five-number summary of a distribution:
n
Boxplot
Minimum, Q1, M, Q3, Maximum
n Five number summary: min, Q1, M, Q3, max
n
n
Outlier: usually, a value higher/lower than 1.5 x IQR
n
Data is represented with a box
n
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ
Variance and standard deviation (sample: s, population: σ)
n
Variance: (algebraic, scalable computation)
1 n
1 n 2 1 n 2
1
s2 =
xi − (∑ xi ) ]
σ2 =
∑ ( xi − x )2 = n − 1[∑
n − 1 i =1
n i =1
N
i =1
n
n
n
∑ ( xi − µ ) 2 =
i =1
1
N
n
2
∑ xi − µ 2
n
The median is marked by a line within the box
n
Whiskers: two lines outside the box extend to
Minimum and Maximum
i =1
Standard deviation s (or σ) is the square root of variance s 2 (or σ2)
February 25, 2016
Data Mining: Concepts and Techniques
67
February 25, 2016
Data Mining: Concepts and Techniques
68
BoxPlots
• Tukey77:JohnW.Tukey,"ExploratoryData
Analysis".Addison-Wesley,Reading,MA.
1977.
• http://informationandvisualization.de/blog/box-plot
•
JaakViloandotherauthors
U T:DataMining2009
69
JaakViloandotherauthors
U T:DataMining2009
70
1.5 x IQR – Inter Quartile range
JaakViloandotherauthors
U T:DataMining2009
71
JaakViloandotherauthors
U T:DataMining2009
72
12
25.02.16
JaakViloandotherauthors
U T:DataMining2009
73
JaakViloandotherauthors
Gea Pajula
U T:DataMining2009
74
Visualization of Data Dispersion: Boxplot Analysis
February 25, 2016
Data Mining: Concepts and Techniques
75
February 25, 2016
Data Mining: Concepts and Techniques
76
Violin plot
February 25, 2016
Data Mining: Concepts and Techniques
77
78
13
25.02.16
Violin plot – R
Combining plots…
vertical kernel density
http://gallery.r-enthusiasts.com/graph/Violin_plot,43
79
Properties of Normal Distribution Curve
n
The normal (distribution) curve
n
From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
From μ–2σ to μ+2σ: contains about 95% of it
n
From μ–3σ to μ+3σ: contains about 99.7% of it
n
February 25, 2016
Data Mining: Concepts and Techniques
81
February 25, 2016
Data Mining: Concepts and Techniques
82
Quantile-Quantile (q-q) Plots
n
Example of plots-containing article:
n http://www.kgs.ku.edu/Magellan/WaterLevels/
CD/Reports/OFR04_57/rep00.htm
February 25, 2016
Data Mining: Concepts and Techniques
n
83
http://onlinestatbook.com/ 2/a dvance d_g rap hs/q- q_p lots.htm l
February 25, 2016
Data Mining: Concepts and Techniques
84
14
25.02.16
Cumulative Distribution Function (CDF)
February 25, 2016
Data Mining: Concepts and Techniques
85
February 25, 2016
Data Mining: Concepts and Techniques
86
A Q–Q plot comparing the distributions of standardizeddaily maximum
temperatures at 25 stations in the US state of Ohio in March and in July. The
curved pattern suggests that the central quantiles are more closely spaced in
July than in March, and that the March distribution is skewed to the right
compared to the July distribution. The data cover the period 1893–2001.
February 25, 2016
n
Data Mining: Concepts and Techniques
87
February 25, 2016
Data Mining: Concepts and Techniques
88
89
February 25, 2016
Data Mining: Concepts and Techniques
90
Parametric modeling usually involves making
assumptions about the shape of data, or the
shape of residuals from a regression fit. Verifying
such assumptions can take many forms, but an
exploration of the shape using histograms and qq plots is very effective. The q-q plot does not
have any design parameters such as the
number of bins for a histogram.
February 25, 2016
Data Mining: Concepts and Techniques
15
25.02.16
Kemmeren et.al. (Mol. Cell, 2002)
Scatter plot
Randomized expression data
Yeast 2-hybrid studies
n
Known (literature) PPI
n
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
MPK1 YLR350w
SNF4 YCL046W
SNF7 YGR122W.
February 25, 2016
February 25, 2016
Data Mining: Concepts and Techniques
93
Bias vs Variance
Data Mining: Concepts and Techniques
92
Abalone data set – P. Danelson hw.
94
Not Correlated Data
Scott Fortmann-Roe
http://scott.fortmann-roe.com/docs/BiasVariance.html
95
February 25, 2016
Data Mining: Concepts and Techniques
96
16
25.02.16
Pearson Correlation
Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for
each set. Note that the correlation reflects the noisiness and direction of a linear
relationship (top row), but not the slope of that relationship (middle), nor many
aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a
slope of 0 but in that case the correlation coefficient is undefined because the
variance of Y is zero.
97
98
Numerical summary?
February 25, 2016
Data Mining: Concepts and Techniques
Anscombe’s quartet
99
February 25, 2016
Data Mining: Concepts and Techniques
100
Loess Curve
n
n
February 25, 2016
Data Mining: Concepts and Techniques
101
Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of dependence
Loess curve is fitted by setting two parameters: a
smoothing parameter, and the degree of the
polynomials that are fitted by the regression
February 25, 2016
Data Mining: Concepts and Techniques
102
17
25.02.16
Chapter 2: Data Preprocessing
Graphic Displays of Basic Statistical Descriptions
n Histogram: (shown before)
n
Why preprocess the data?
n Quantile plot: each value xi is paired with fi indicating
n
Descriptive data summarization
n Quantile-quantile (q-q) plot: graphs the quantiles of one
n
Data cleaning
n
Data integration and transformation
n
Data reduction
n
Discretization and concept hierarchy generation
n
Summary
n Boxplot: (covered before)
that approximately 100 fi % of data are ≤ xi
univariant distribution against the corresponding quantiles
of another
n Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
n Loess (local regression) curve: add a smooth curve to a
scatter plot to provide better perception of the pattern of
dependence
February 25, 2016
Data Mining: Concepts and Techniques
103
February 25, 2016
Data Mining: Concepts and Techniques
Missing Data
Data Cleaning
n
n
Importance
n “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
n “Data cleaning is the number one problem in data
warehousing”—DCI survey
n
n
Fill in missing values
n
Identify outliers and smooth out noisy data
n
Correct inconsistent data
n
Data Mining: Concepts and Techniques
n
equipment malfunction
n
inconsistent with other recorded data and thus deleted
n
data not entered due to misunderstanding
n
certain data may not be considered important at the time of
entry
n
105
not register history or changes of the data
Missing data may need to be inferred.
February 25, 2016
How to Handle Missing Data?
n
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
n
Resolve redundancy caused by data integration
February 25, 2016
Data is not always available
n
Data cleaning tasks
n
104
Data Mining: Concepts and Techniques
106
K-NN impute
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
n
Fill in the missing value manually: tedious + infeasible?
n
Fill in it automatically with
n
a global constant : e.g., “unknown”, a new class?!
n
the attribute mean
n
the attribute mean for all samples belonging to the same class:
K nearest neighbours imputation
n
Find K neighbours on available data points
n
Estimate the missing value
(Hastie, Tibshirani, Troyanskaya, … Stanford 19992001)
smarter
n
n
n
the most probable value: inference-based such
as Bayesian formula or decision tree
February 25, 2016
Data Mining: Concepts and Techniques
107
February 25, 2016
Data Mining: Concepts and Techniques
108
18
25.02.16
Noisy Data
n
n
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
n faulty data collection instruments
n
n
n
n
Data Mining: Concepts and Techniques
109
n
duplicate records
incomplete data
n
inconsistent data
February 25, 2016
How to Handle Noisy Data?
n
n
n
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Clustering
n detect and remove outliers
n
n
n
Data Mining: Concepts and Techniques
n
Divides the range into N intervals of equal size: uniform grid
n
if A and B are the lowest and highest values of the attribute, the
n
The most straightforward, but outliers may dominate presentation
n
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately
same number of samples
detect suspicious values and check by human (e.g.,
deal with possible outliers)
February 25, 2016
Equal-width (distance) partitioning
n
Combined computer and human inspection
n
111
n
Good data scaling
n
Managing categorical attributes can be tricky
February 25, 2016
Binning Methods for Data Smoothing
q
110
width of intervals will be: W = (B –A )/N.
Regression
n smooth by fitting the data into regression functions
n
Data Mining: Concepts and Techniques
Simple Discretization Methods: Binning
Binning
n
data transmission problems
technology limitation
n inconsistency in naming convention
Other data problems which requires data cleaning
n
February 25, 2016
data entry problems
Data Mining: Concepts and Techniques
112
Regression
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
y
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
Y1
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
y=x+1
Y1’
X1
- Bin 3: 29, 29, 29, 29
x
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
February 25, 2016
Data Mining: Concepts and Techniques
113
February 25, 2016
Data Mining: Concepts and Techniques
114
19
25.02.16
Cluster Analysis
Data Cleaning as a Process
n
G1
G3
G2
n
n
February 25, 2016
Data Mining: Concepts and Techniques
115
Data discrepancy detection
n
Use metadata (e.g., domain, range, dependency, distribution)
n
Check field overloading
n
Check uniqueness rule, consecutive rule and null rule
n
Use commercial tools
n Data scrubbing: use simple domain knowledge
(e.g., postal
code, spell-check) to detect errors and make corrections
n Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
Data migration and integration
n
Data migration tools: allow transformations to be specified
n
ETL (Extraction/Transformation/Lo adi ng) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
n
Iterative and interactive (e.g., Potter’s Wheels)
February 25, 2016
n
Why preprocess the data?
n
Data cleaning
n
Data integration and transformation
n
Data reduction
n
Discretization and concept hierarchy generation
n
Summary
February 25, 2016
n
Data Mining: Concepts and Techniques
116
Data Integration
Chapter 2: Data Preprocessing
n
Data Mining: Concepts and Techniques
n
n
117
Data integration:
n Combines data from multiple sources into a coherent
store
Schema integration: e.g., A.cust-id ≡ B.cust-#
n Integrate metadata from different sources
Entity identification problem:
n Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
n For the same real world entity, attribute values from
different sources are different
n Possible reasons: different representations, different
scales, e.g., metric vs. British units
February 25, 2016
Data Mining: Concepts and Techniques
118
Handling Redundancy in Data Integration
n
Redundant data occur often when integration of multiple
databases
n
n
n
n
February 25, 2016
Data Mining: Concepts and Techniques
119
Object identification: The same attribute or object
may have different names in different databases
Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected by
correlation analysis
Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
February 25, 2016
Data Mining: Concepts and Techniques
120
20
25.02.16
February 25, 2016
Data Mining: Concepts and Techniques
121
February 25, 2016
Data Mining: Concepts and Techniques
122
Normalisation
n
February 25, 2016
Data Mining: Concepts and Techniques
123
Making data comparable…
February 25, 2016
Data Mining: Concepts and Techniques
124
Elements of microarray statistics
0,8%
Reference
0,6%
0,4%
0,2%
Test
M
= log 2R – log 2G
= log 2(R/G)
A
= 1/2 (log 2R + log 2G)
Series1%
Series2%
Series3%
Series4%
0%
0%
20%
40%
60%
80%
100%
120%
!0,2%
!0,4%
!0,6%
February
25, 2016
Data Mining: Concepts and Techniques
125
21
25.02.16
Normalisation
can be used to
transform data
JaakViloandotherauthors
U T:DataMining2009
127
Ex p re s s i on Pro if l er
Data Reduction Strategies
Chapter 2: Data Preprocessing
n
n
Why preprocess the data?
n
Data cleaning
n
Data integration and transformation
n
Data reduction
n
Discretization and concept hierarchy generation
n
Why data reduction?
n
A database/data warehouse may store terabytes of data
n
n
n
n
Data cube aggregation:
Dimensionality reduction — e.g., remove unimportant attributes
Data Compression
n
Numerosity reduction — e.g., fit data into models
n
Discretization and concept hierarchy generation
n
Data Mining: Concepts and Techniques
129
Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
n
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
n
Summary
February 25, 2016
February 25, 2016
Why preprocess the data?
n
Data cleaning
n
Data integration and transformation
n
Data reduction
n
Discretization and concept hierarchy generation
n
n
Nominal — values from an unordered set, e.g., color, profession
n
Ordinal — values from an ordered set, e.g., military or academic
rank
n
Summary
February 25, 2016
Data Mining: Concepts and Techniques
130
Three types of attributes:
n
n
Data Mining: Concepts and Techniques
Discretization
Chapter 2: Data Preprocessing
n
128
131
Continuous — real numbers, e.g., integer or real numbers
Discretization:
n
Divide the range of a continuous attribute into intervals
n
Some classification algorithms only accept categorical attributes.
n
Reduce data size by discretization
n
Prepare for further analysis
February 25, 2016
Data Mining: Concepts and Techniques
132
22
25.02.16
Discretization and Concept Hierarchy
n
E.g.
Discretization
n
n
Reduce the number of values for a given continuous attribute by
n
dividing the range of the attribute into intervals
n
n
Interval labels can then be used to replace actual data values
n
Supervised vs. unsupervised
n
Split (top-down) vs. merge (bottom-up)
n
Discretization can be performed recursively on an attribute
n
n
Concept hierarchy formation
n
n
concepts (such as numeric values for age) by higher level concepts
n
(such as young, middle-aged, or senior)
n
February 25, 2016
Data Mining: Concepts and Techniques
133
1580
8394
2700
0-20
21-40
41-60
61-105
1780
4200
4767
2900
Vs
n
Recursively reduce the data by collecting and replacing low level
0-18 y
19-65 y
66-105
February 25, 2016
Data Mining: Concepts and Techniques
134
Example of 3-4-5 Rule
Segmentation by Natural Partitioning
co u n t
A simply 3-4-5 rule can be used to segment numeric data
n
into relatively uniform, “natural” intervals.
n
-$ 3 51
S tep 1:
If an interval covers 3, 6, 7 or 9 distinct values at the
-$ 1 5 9
M in
S tep 2 :
most significant digit, partition the range into 3 equi-
(-$ 1 ,0 00 - 0)
n
(-$ 4 0 0 - 0 )
(-$ 4 0 0 -$ 3 0 0 )
If it covers 1, 5, or 10 distinct values at the most
(-$ 3 0 0 -$ 2 0 0 )
significant digit, partition the range into 5 intervals
February 25, 2016
-351,976.00 .. 4,700,896.50
(-$ 2 0 0 -$ 1 0 0 )
Data Mining: Concepts and Techniques
135
3 v alue ranges
1. (-1,000,000 .. 0]
2. (0 .. 1,000,000]
3. (1,000,000 .. 2,000,000]
Adjust with real MIN and MAX
(-400,000 .. 0]
(0 .. 1,000,000]
(1,000,000 .. 2,000,000]
(2,000,000 .. 5,000,000]
February 25, 2016
Data Mining: Concepts and Techniques
($ 1 ,0 0 0 - $ 2,00 0)
(0 -$ 1 ,000)
($ 1 ,0 0 0 - $ 2, 00 0)
(0 - $ 1 ,0 0 0 )
(0 $ 2 00)
($ 1,000 $ 1 ,2 00)
($ 2 0 0 $400)
($ 1 ,2 0 0 $ 1 ,4 0 0 )
($ 1 ,4 0 0 $ 1 ,6 0 0 )
($ 4 0 0 $600)
($ 6 0 0 $800)
(-$ 1 0 0 0)
February 25, 2016
($ 1 ,6 0 0 ($ 1 ,8 0 0 $ 1 ,8 0 0 )
$ 2 ,0 0 0 )
($ 8 0 0 $ 1 ,0 0 0 )
($ 2 ,0 0 0 - $ 5, 00 0)
($ 2 ,0 0 0 $ 3 ,0 0 0 )
($ 3 ,0 0 0 $ 4 ,0 0 0 )
($ 4 ,000 $ 5 ,0 0 0)
Data Mining: Concepts and Techniques
136
Recursive …
Example
MIN= -351,976.00
MAX=4,700,896.50
LOW = 5th percentile -159,876
HIGH = 95th percentile 1,838,761
msd = 1,000,000 (most significant digit)
LOW = -1,000,000 (round down) HIGH = 2,000,000 (round up)
1.
2.
3.
4.
M ax
(-$ 4 0 0 -$ 5 ,00 0)
S tep 4:
significant digit, partition the range into 4 intervals
$ 4 ,70 0
(-$ 1 ,0 0 0 - $2 ,00 0)
S tep 3 :
If it covers 2, 4, or 8 distinct values at the most
$1 ,83 8
High (i.e, 9 5%-0 tile)
Lo w= -$ 1 ,0 0 0 Hig h = $ 2 ,0 0 0
width intervals
n
p ro fit
Lo w (i.e, 5 %-tile)
msd = 1 0
, 00
1.1.
1.2.
1.3.
1.4.
(-400,000 .. -300,000 ]
(-300,000 .. -200,000 ]
(-200,000 .. -100,000 ]
(-100,000 .. 0 ]
2.1.
2.2.
2.3.
2.4.
2.5.
(0 .. 200,000 ]
(200,000 .. 400,000 ]
(400,000 .. 600,000 ]
(600,000 .. 800,000 ]
(800,000 .. 1,000,000 ]
3.1.
3.2.
3.3.
3.4.
3.5.
(1,000,000 .. 1,200,000 ]
(1,200,000 .. 1,400,000 ]
(1,400,000 .. 1,600,000 ]
(1,600,000 .. 1,800,000 ]
(1,800,000 .. 2,000,000 ]
4.1. (2,000,000 .. 3,000,000 ]
4.2. (3,000,000 .. 4,000,000 ]
4.3. (4,000,000 .. 5,000,000 ]
137
JaakViloandotherauthors
U T:DataMining2009
138
23
25.02.16
Concept Hierarchy Generation for Categorical Data
• Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
Automatic Concept Hierarchy Generation
n
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit
data grouping
Some hierarchies can be automatically generated based
on the analysis of the number of distinct values per
attribute in the data set
n The attribute with the most distinct values is placed
at the lowest level of the hierarchy
n Exceptions, e.g., weekday, month, quarter, year
– {Urbana, Champaign, Chicago} < Illinois
province_or_ state
– E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
DataMining:Concepts andTechniques
February 25, 2016
Chapter 2: Data Preprocessing
Why preprocess the data?
n
Data cleaning
n
Data integration and transformation
n
Data reduction
n
Discretization and concept hierarchy
n
n
generation
n
Data Mining: Concepts and Techniques
140
Summary
n
n
674,339 distinct values
street
139
365 distinct values
3567 distinct values
city
– E.g., for a set of attributes: {street, city, state, country}
February 25, 2016
15 distinct values
country
• Specification of only a partial set of attributes
n
Summary
Data preparation or preprocessing is a big issue for both
data warehousing and data mining
Discriptive data summarization is need for quality data
preprocessing
Data preparation includes
n
Data cleaning and data integration
n
Data reduction and feature selection
n
Discretization
A lot a methods have been developed but data
preprocessing still an active area of research
February 25, 2016
Data Mining: Concepts and Techniques
141
February 25, 2016
Data Mining: Concepts and Techniques
142
References
n
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments.
of ACM, 42:73-78,
Communications
1999
n
T. Dasu and T. Johnson.
n
T. Dasu, T. Johnson, S. Muthukrishnan,
Exploratory
Data Mining and Data Cleaning. John Wiley & Sons, 2003
V. Shkapenyuk.
Mining Database Structure;
Or, How to Build
a Data Quality Browser. SIGMOD’02.
n
H.V. Jagadish et al., Special Issue on Data Reduction Techniques.
Committee
on Data Engineering,
Bulletin of the Technical
20( 4) , December 1997
n
D. Pyle. Data Preparation
n
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches.
Technical Committee on Data Engineering. Vol.23, No.4
n
V. Raman and J. Hellerstein. Potters Wheel: An Interactive
Transformation,
for Data Mining. Morgan Kaufmann,
T. Redman. Data Quality: Management
n
Y. Wand and R. Wang. Anchoring
n
IEEE Bulletin of the
for Data Cleaning and
and Technology. Bantam Books, 1992
data quality dimensions
ontological foundations.
Communications
of
1996
R. Wang, V. Storey, and C. Firth. A framework
Knowledge and Data Engineering,
February 25, 2016
Framework
VLDB’2001
n
ACM, 39:86-95,
1999
7:623-640,
for analysis of data quality research. IEEE Trans.
1995
Data Mining: Concepts and Techniques
143
24