Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
25.02.16 Knowledge Discovery (KDD) Process • This is a view from typical database systems and data warehousing communities • Data mining plays an essential role in the knowledge discovery process DataMining MTAT.03.183 Descriptive analysis,preprocessing, visualisation... Pattern Evaluation Data Mining Task-relevant Data JaakVilo 2016Spring Selection Data Warehouse Data Cleaning Data Integration Jiawei Han, Micheline Databases Kamber, and Jian Pei Data in the real world is dirty n incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data n n n n n n e.g., occupation=“ ” n n n February 25, 2016 n e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records Data Mining: Concepts and Techniques n n n Quality decisions must be based on quality data n n e.g., duplicate or missing data may cause incorrect or even misleading statistics. n Data warehouse needs consistent integration of quality data n n Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse n n n Garbage in, garbage out n n February 25, 2016 Data Mining: Concepts and Techniques n Errors in data transmission Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning Data Mining: Concepts and Techniques 4 High quality data requirements No quality data, no quality mining results! n problems Inconsistent data may come from February 25, 2016 Why Is Data Preprocessing Important? n Human/hardware/software Faulty data collection instruments Human or computer error at data entry n n Different considerations between the time when the data was collected and when it is analyzed. n n 3 “Not applicable” data value when collected Noisy data (incorrect values) may come from n e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names n Incomplete data may come from n noisy: containing errors or outliers n 2 Why Is Data Dirty? Why Data Preprocessing? n DataMining:Concepts andTechniques 5 High-quality data needs to pass a set of quality criteria. Those include: Accuracy: an aggregated value over the criteria of integrity, consistency, and density Integrity: an aggregated value over the criteria of completeness and validity Completeness: achieved by correcting data containing anomalies Validity: approximated by the amount of data satisfying integrity constraints Consistency: concerns contradictions and syntactical anomalies Uniformity: directly related to irregularities and in compliance with the set 'unit of measure' Density: the quotient of missing values in the data and the number of total values ought to be known http://en.wikipedia.o rg/ wik i/Dat a_cle ansi ng n February 25, 2016 Data Mining: Concepts and Techniques 6 1 25.02.16 Multi-Dimensional Measure of Data Quality n n February 25, 2016 Data Mining: Concepts and Techniques 7 A well-accepted multidimensional view: n Accuracy n Completeness n Consistency n Timeliness n Believability n Value added n Interpretability n Accessibility Broad categories: n Intrinsic, contextual, representational, and accessibility February 25, 2016 Major Tasks in Data Preprocessing n 8 Forms of Data Preprocessing Data cleaning n Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies n Data integration n Data transformation n Data reduction n n n n Data Mining: Concepts and Techniques Integration of multiple databases, data cubes, or files Normalization and aggregation Obtains reduced representation in volume but produces the same or similar analytical results Data discretization n Part of data reduction but with particular importance, especially for numerical data February 25, 2016 Data Mining: Concepts and Techniques 9 February 25, 2016 Data Mining: Concepts and Techniques 10 Chapter 2: Data Preprocessing n Why preprocess the data? n Descriptive data summarization n Data cleaning n Data integration and transformation n Data reduction n Discretization and concept hierarchy generation n Summary • 0.480.030.060.050.430.190.160.350.250.07 0.290.140.960.020.110.220.800.050.540.36 0.230.280.020.100.480.310.360.210.330.45 0.640.040.480.560.160.580.330.110.420.06 0.000.230.240.000.540.020.260.200.180.01 0.170.170.040.970.250.040.340.010.500.15 0.430.050.500.160.520.820.230.090.020.21 0.130.170.330.260.000.330.570.430.090.43 0.240.080.080.540.080.020.020.010.350.62 0.100.030.140.780.300.070.080.480.570.30 JaakViloandotherauthors February 25, 2016 Data Mining: Concepts and Techniques U T:DataMining2009 12 11 2 25.02.16 1,20# 1,20# 1,00# 1,00# 0,80# 0,80# 0,60# 0,60# Series1# Series1# 0,40# 0,40# 0,20# 0,20# 0,00# 0# 20# 40# 60# 80# 100# 120# 0,00# 0# 20# 40# 60# 80# 100# 120# JaakViloandotherauthors 1,20# 15 1,00# U T:DataMining2009 14 0,80# JaakViloandotherauthors U T:DataMining2009 0,60# JaakViloandotherauthors 0,40# 13 0,20# U T:DataMining2009 0,00# JaakViloandotherauthors 0# 20# 40# 60# 80# 100# 120# Series1# n use Big_University_DB mine characteristics as "Science_Students" in relevance to name,gender,major,birth_date,residence,pho ne#,gpa from student where statusin graduate Motivation n To better understand the data: central tendency, variation and spread n Data dispersion characteristics n Numerical dimensions correspond to sorted intervals n n U T:DataMining2009 16 Mining Data Descriptive Characteristics Characterisedata JaakViloandotherauthors U T:DataMining2009 median, max, min, quantiles, outliers, variance, etc. n Data dispersion: analyzed with multiple granularities of precision n Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures n Folding measures into numerical dimensions n Boxplot or quantile analysis on the transformed cube 17 February 25, 2016 Data Mining: Concepts and Techniques 18 3 25.02.16 Measuring the Central Tendency n Mean (algebraic measure) (sample vs. population): n Weighted arithmetic mean: 1 n ∑ xi n i =1 x= n n Trimmed mean: chopping extreme values x= Median: A holistic measure n x • • N ∑w x i n µ=∑ i i =1 n • ∑ wi i =1 Middle value if odd number of values, or average of the middle two • values otherwise n n Estimated by interpolation (for grouped data): median = L1 + ( Mode n Value that occurs most frequently in the data n Unimodal, bimodal, trimodal n Empirical formula: February 25, 2016 JaakViloandotherauthors n / 2 − (∑ f )l f median )c H is tograms and Probability D ens ityFunctions Probability Dens ity Functions – Totalareaundercurveintegratesto1 Frequency His tograms – Y-axis is counts – Simpleinterpretation – Can'tbedirectlyrelatedtoprobabilities ordens ityfunctions Relative Frequency His tograms – Dividecounts bytotalnumberofobs ervations – Y-axis is relativefrequency – Canbeinterpretedas probabilitiesforeachrange – Can'tbedirectlyrelatedtodensityfunction • • B ar h ei gh ts su m to 1 b u t wo n ' t i n tegrate to 1 u n l ess b ar wi d th = 1 Dens ity His tograms – Dividecounts by(totalnumberofobservations Xbarwidth) – Y-axis is densityvalues – – BarheightXbarwidthgivesprobabilityforeachrange Canbedirectlyrelatedtodensityfunction • B ar areas su m to 1 mean − mode = 3 × (mean − median ) Data Mining: Concepts and Techniques U T:DataMining2009 h ttp ://www.g e o g .u c sb .e du /~jo e /l g 21 0 _ w07 /l ec ture _ no te s/l ec t04 /o h0 7 _0 4 _1 .h m tl JaakViloandotherauthors U T:DataMining2009 20 JaakViloandotherauthors U T:DataMining2009 22 19 21 histograms • equalsub-intervals,knownas`bins‘ • breakpoints • binwidth JaakViloandotherauthors U T:DataMining2009 23 JaakViloandotherauthors U T:DataMining2009 24 4 25.02.16 The data are (the log of) wing spans of aircraft built in from 1956 - 1984. JaakViloandotherauthors U T:DataMining2009 25 JaakViloandotherauthors U T:DataMining2009 26 2-dimensionaldata:dot-plot JaakViloandotherauthors U T:DataMining2009 27 DataMining:Concepts andTechniques February 25, 2016 28 Histogramvskerneldensity • propertiesofhistogramswiththesetwoexamples: – they are notsmooth – dependonendpointsofbins – dependonwidthofbins • Wecanalleviatethefirsttwoproblemsbyusing kerneldensityestimators. • Toremovethedependenceontheendpointsofthe bins,wecentreeachoftheblocksateachdatapoint ratherthanfixingtheendpointsoftheblocks. JaakViloandotherauthors U T:DataMining2009 29 block of width 1 and height 1/12 (the dotted boxes) as they are 12 data points, and then add them up JaakViloandotherauthors U T:DataMining2009 30 5 25.02.16 • It'simportanttochoosethemostappropriate bandwidthasavaluethatistoosmallortoolargeis notuseful. • Ifweuseanormal(Gaussian)kernelwithbandwidth orstandarddeviationof0.1(whichhasarea1/12 undertheeachcurve)thenthekerneldensity estimateissaidtoundersmoothedasthebandwidth istoosmallinthefigurebelow. • Itappearsthatthereare4modesinthisdensitysomeofthesearesurelyartificesofthedata. • Blocks- itisstilldiscontinuous aswehave usedadiscontinuous kernelasourbuilding block • Ifweuseasmoothkernelforourbuilding block,thenwewillhaveasmoothdensity estimate. JaakViloandotherauthors U T:DataMining2009 31 JaakViloandotherauthors U T:DataMining2009 32 JaakViloandotherauthors U T:DataMining2009 33 JaakViloandotherauthors U T:DataMining2009 34 • Theoptimalvalueofthebandwidthforourdatasetis about0.25. • Fromtheoptimallysmoothedkerneldensity estimate,therearetwomodes.Asthesearethelog ofaircraftwingspan,itmeansthattherewerea groupofsmaller,lighterplanesbuilt,andtheseare clusteredaround2.5(whichisabout12m). • Whereasthelargerplanes,maybeusingjetengines astheseusedonacommercialscalefromaboutthe 1960s,aregroupedaround3.5(about33m). • Chooseoptimalbandwith –Methods to estimateit • AMISE =AsymptoticMean Integrated Squared Error • optimalbandwidth=arg minAMISE JaakViloandotherauthors U T:DataMining2009 35 JaakViloandotherauthors U T:DataMining2009 36 6 25.02.16 • Thepropertiesofkerneldensityestimators are,ascomparedtohistograms: – smooth – noendpoints – dependonbandwidth JaakViloandotherauthors U T:DataMining2009 37 JaakViloandotherauthors U T:DataMining2009 38 39 JaakViloandotherauthors U T:DataMining2009 40 KernelDensityestimation- links • Wikipedia: http://en.wikipedia.org/wiki/Kernel_density_estimation • RicardoGutierrez-Osuna http://research.cs.tamu.edu/prism/lectures/pr/pr_l7.pdf • TutorialandJavaappletfortesting: – http://parallel.vub.ac.be/research/causalModels/tutorial /kde.html • SimpleinteractivewebdemoatU.Tartu: – https://courses.cs.ut.ee/demos/kernel-density-estimation/ – http://kde.tume-maailm.pri.ee/ – JaakViloandotherauthors U T:DataMining2009 R– example(dueK.Tretjakov) http://parallel.vub.ac.be/research/causalModels/tutorial /kde.html d=c(1,2,2,2,2,1,2,2,2,3,2,3,4,5,4,3,2,3,4,4,5,6,7); kernelsmooth<- function(data,sigma,x){ result=0; for(dindata){ result=result+exp(-(x-d)^2/2/sigma^2); } result/sqrt(2*pi)/sigma; } x= seq(min(d),max(d),by=0.1); y= sapply(x,function(x){kernelsmooth(d,1,x)}); hist(d); lines(x,y); JaakViloandotherauthors U T:DataMining2009 41 JaakViloandotherauthors U T:DataMining2009 42 7 25.02.16 JaakViloandotherauthors U T:DataMining2009 43 http://kde.tume-maailm.pri.ee/ JaakViloandotherauthors JaakViloandotherauthors U T:DataMining2009 44 U T:DataMining2009 46 http://kde.tume-maailm.pri.ee/ U T:DataMining2009 45 JaakViloandotherauthors Aggregation,analysisand visualizationofgeodata http://kde.tume-maailm.pri.ee/ • http://sightsmap.com • Severallarge crowd-sourced datasets: – The whole Panoramio photobank used byGoogle maps – The whole Wikipedia, geotags andwikipedia article logs – Foursquare – ...More • Aggregate data,calculate popularites of places, calculate typetags • Translate and categorizetitles and descriptions togettypes • Improveaggregation algorithms bylearning • Visualize heatmaps, typetags, aggregatedsources JaakViloandotherauthors U T:DataMining2009 47 By: Tanel Tammet 8 25.02.16 By: Tanel Tammet By: Tanel Tammet By: Tanel Tammet By: Tanel Tammet By: Tanel Tammet JaakViloandotherauthors U T:DataMining2009 54 9 25.02.16 • http://176.32.89.45/~hideaki/res/kernel.html http://jmlr.csail.mit.edu/proceedings/papers/v2/kontkanen07a/kontkanen07a.pdf JaakViloandotherauthors U T:DataMining2009 55 JaakViloandotherauthors U T:DataMining2009 56 JaakViloandotherauthors U T:DataMining2009 57 JaakViloandotherauthors U T:DataMining2009 58 MorelinksonRandkerneldensity • Rtutorial • http://en.wikipedia.org/wiki/Kernel_density_estimation – http://cran.r-project.org/doc/manuals/Rintro.html – http://www.google.com/search?q=R+tutorial • http://sekhon.berkeley.edu/stats/html/density.html • http://stat.ethz.ch/R-manual/Rpatched/library/stats/html/density.html • http://www.google.com/search?q=kernel+density+estimation+R JaakViloandotherauthors U T:DataMining2009 59 JaakViloandotherauthors U T:DataMining2009 60 10 25.02.16 JaakViloandotherauthors U T:DataMining2009 61 JaakViloandotherauthors U T:DataMining2009 62 JaakViloandotherauthors U T:DataMining2009 63 JaakViloandotherauthors U T:DataMining2009 64 Symmetric vs. Skewed Data n Median, mean and mode of symmetric, positively and negatively skewed data Author: Gea Pajula JaakViloandotherauthors U T:DataMining2009 65 February 25, 2016 Data Mining: Concepts and Techniques 66 11 25.02.16 Boxplot Analysis Measuring the Dispersion of Data Quartiles, outliers and boxplots n n Quartiles: Q1 (25th percentile), Q3 (75th percentile) n Inter-quartile range: IQR = Q3 – Q1 Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually n Five-number summary of a distribution: n Boxplot Minimum, Q1, M, Q3, Maximum n Five number summary: min, Q1, M, Q3, max n n Outlier: usually, a value higher/lower than 1.5 x IQR n Data is represented with a box n The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ Variance and standard deviation (sample: s, population: σ) n Variance: (algebraic, scalable computation) 1 n 1 n 2 1 n 2 1 s2 = xi − (∑ xi ) ] σ2 = ∑ ( xi − x )2 = n − 1[∑ n − 1 i =1 n i =1 N i =1 n n n ∑ ( xi − µ ) 2 = i =1 1 N n 2 ∑ xi − µ 2 n The median is marked by a line within the box n Whiskers: two lines outside the box extend to Minimum and Maximum i =1 Standard deviation s (or σ) is the square root of variance s 2 (or σ2) February 25, 2016 Data Mining: Concepts and Techniques 67 February 25, 2016 Data Mining: Concepts and Techniques 68 BoxPlots • Tukey77:JohnW.Tukey,"ExploratoryData Analysis".Addison-Wesley,Reading,MA. 1977. • http://informationandvisualization.de/blog/box-plot • JaakViloandotherauthors U T:DataMining2009 69 JaakViloandotherauthors U T:DataMining2009 70 1.5 x IQR – Inter Quartile range JaakViloandotherauthors U T:DataMining2009 71 JaakViloandotherauthors U T:DataMining2009 72 12 25.02.16 JaakViloandotherauthors U T:DataMining2009 73 JaakViloandotherauthors Gea Pajula U T:DataMining2009 74 Visualization of Data Dispersion: Boxplot Analysis February 25, 2016 Data Mining: Concepts and Techniques 75 February 25, 2016 Data Mining: Concepts and Techniques 76 Violin plot February 25, 2016 Data Mining: Concepts and Techniques 77 78 13 25.02.16 Violin plot – R Combining plots… vertical kernel density http://gallery.r-enthusiasts.com/graph/Violin_plot,43 79 Properties of Normal Distribution Curve n The normal (distribution) curve n From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) From μ–2σ to μ+2σ: contains about 95% of it n From μ–3σ to μ+3σ: contains about 99.7% of it n February 25, 2016 Data Mining: Concepts and Techniques 81 February 25, 2016 Data Mining: Concepts and Techniques 82 Quantile-Quantile (q-q) Plots n Example of plots-containing article: n http://www.kgs.ku.edu/Magellan/WaterLevels/ CD/Reports/OFR04_57/rep00.htm February 25, 2016 Data Mining: Concepts and Techniques n 83 http://onlinestatbook.com/ 2/a dvance d_g rap hs/q- q_p lots.htm l February 25, 2016 Data Mining: Concepts and Techniques 84 14 25.02.16 Cumulative Distribution Function (CDF) February 25, 2016 Data Mining: Concepts and Techniques 85 February 25, 2016 Data Mining: Concepts and Techniques 86 A Q–Q plot comparing the distributions of standardizeddaily maximum temperatures at 25 stations in the US state of Ohio in March and in July. The curved pattern suggests that the central quantiles are more closely spaced in July than in March, and that the March distribution is skewed to the right compared to the July distribution. The data cover the period 1893–2001. February 25, 2016 n Data Mining: Concepts and Techniques 87 February 25, 2016 Data Mining: Concepts and Techniques 88 89 February 25, 2016 Data Mining: Concepts and Techniques 90 Parametric modeling usually involves making assumptions about the shape of data, or the shape of residuals from a regression fit. Verifying such assumptions can take many forms, but an exploration of the shape using histograms and qq plots is very effective. The q-q plot does not have any design parameters such as the number of bins for a histogram. February 25, 2016 Data Mining: Concepts and Techniques 15 25.02.16 Kemmeren et.al. (Mol. Cell, 2002) Scatter plot Randomized expression data Yeast 2-hybrid studies n Known (literature) PPI n Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane MPK1 YLR350w SNF4 YCL046W SNF7 YGR122W. February 25, 2016 February 25, 2016 Data Mining: Concepts and Techniques 93 Bias vs Variance Data Mining: Concepts and Techniques 92 Abalone data set – P. Danelson hw. 94 Not Correlated Data Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html 95 February 25, 2016 Data Mining: Concepts and Techniques 96 16 25.02.16 Pearson Correlation Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero. 97 98 Numerical summary? February 25, 2016 Data Mining: Concepts and Techniques Anscombe’s quartet 99 February 25, 2016 Data Mining: Concepts and Techniques 100 Loess Curve n n February 25, 2016 Data Mining: Concepts and Techniques 101 Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression February 25, 2016 Data Mining: Concepts and Techniques 102 17 25.02.16 Chapter 2: Data Preprocessing Graphic Displays of Basic Statistical Descriptions n Histogram: (shown before) n Why preprocess the data? n Quantile plot: each value xi is paired with fi indicating n Descriptive data summarization n Quantile-quantile (q-q) plot: graphs the quantiles of one n Data cleaning n Data integration and transformation n Data reduction n Discretization and concept hierarchy generation n Summary n Boxplot: (covered before) that approximately 100 fi % of data are ≤ xi univariant distribution against the corresponding quantiles of another n Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane n Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence February 25, 2016 Data Mining: Concepts and Techniques 103 February 25, 2016 Data Mining: Concepts and Techniques Missing Data Data Cleaning n n Importance n “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball n “Data cleaning is the number one problem in data warehousing”—DCI survey n n Fill in missing values n Identify outliers and smooth out noisy data n Correct inconsistent data n Data Mining: Concepts and Techniques n equipment malfunction n inconsistent with other recorded data and thus deleted n data not entered due to misunderstanding n certain data may not be considered important at the time of entry n 105 not register history or changes of the data Missing data may need to be inferred. February 25, 2016 How to Handle Missing Data? n E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to n Resolve redundancy caused by data integration February 25, 2016 Data is not always available n Data cleaning tasks n 104 Data Mining: Concepts and Techniques 106 K-NN impute Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. n Fill in the missing value manually: tedious + infeasible? n Fill in it automatically with n a global constant : e.g., “unknown”, a new class?! n the attribute mean n the attribute mean for all samples belonging to the same class: K nearest neighbours imputation n Find K neighbours on available data points n Estimate the missing value (Hastie, Tibshirani, Troyanskaya, … Stanford 19992001) smarter n n n the most probable value: inference-based such as Bayesian formula or decision tree February 25, 2016 Data Mining: Concepts and Techniques 107 February 25, 2016 Data Mining: Concepts and Techniques 108 18 25.02.16 Noisy Data n n Noise: random error or variance in a measured variable Incorrect attribute values may due to n faulty data collection instruments n n n n Data Mining: Concepts and Techniques 109 n duplicate records incomplete data n inconsistent data February 25, 2016 How to Handle Noisy Data? n n n first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering n detect and remove outliers n n n Data Mining: Concepts and Techniques n Divides the range into N intervals of equal size: uniform grid n if A and B are the lowest and highest values of the attribute, the n The most straightforward, but outliers may dominate presentation n Skewed data is not handled well Equal-depth (frequency) partitioning Divides the range into N intervals, each containing approximately same number of samples detect suspicious values and check by human (e.g., deal with possible outliers) February 25, 2016 Equal-width (distance) partitioning n Combined computer and human inspection n 111 n Good data scaling n Managing categorical attributes can be tricky February 25, 2016 Binning Methods for Data Smoothing q 110 width of intervals will be: W = (B –A )/N. Regression n smooth by fitting the data into regression functions n Data Mining: Concepts and Techniques Simple Discretization Methods: Binning Binning n data transmission problems technology limitation n inconsistency in naming convention Other data problems which requires data cleaning n February 25, 2016 data entry problems Data Mining: Concepts and Techniques 112 Regression Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 y * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 Y1 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 y=x+1 Y1’ X1 - Bin 3: 29, 29, 29, 29 x * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 February 25, 2016 Data Mining: Concepts and Techniques 113 February 25, 2016 Data Mining: Concepts and Techniques 114 19 25.02.16 Cluster Analysis Data Cleaning as a Process n G1 G3 G2 n n February 25, 2016 Data Mining: Concepts and Techniques 115 Data discrepancy detection n Use metadata (e.g., domain, range, dependency, distribution) n Check field overloading n Check uniqueness rule, consecutive rule and null rule n Use commercial tools n Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections n Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) Data migration and integration n Data migration tools: allow transformations to be specified n ETL (Extraction/Transformation/Lo adi ng) tools: allow users to specify transformations through a graphical user interface Integration of the two processes n Iterative and interactive (e.g., Potter’s Wheels) February 25, 2016 n Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization and concept hierarchy generation n Summary February 25, 2016 n Data Mining: Concepts and Techniques 116 Data Integration Chapter 2: Data Preprocessing n Data Mining: Concepts and Techniques n n 117 Data integration: n Combines data from multiple sources into a coherent store Schema integration: e.g., A.cust-id ≡ B.cust-# n Integrate metadata from different sources Entity identification problem: n Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts n For the same real world entity, attribute values from different sources are different n Possible reasons: different representations, different scales, e.g., metric vs. British units February 25, 2016 Data Mining: Concepts and Techniques 118 Handling Redundancy in Data Integration n Redundant data occur often when integration of multiple databases n n n n February 25, 2016 Data Mining: Concepts and Techniques 119 Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality February 25, 2016 Data Mining: Concepts and Techniques 120 20 25.02.16 February 25, 2016 Data Mining: Concepts and Techniques 121 February 25, 2016 Data Mining: Concepts and Techniques 122 Normalisation n February 25, 2016 Data Mining: Concepts and Techniques 123 Making data comparable… February 25, 2016 Data Mining: Concepts and Techniques 124 Elements of microarray statistics 0,8% Reference 0,6% 0,4% 0,2% Test M = log 2R – log 2G = log 2(R/G) A = 1/2 (log 2R + log 2G) Series1% Series2% Series3% Series4% 0% 0% 20% 40% 60% 80% 100% 120% !0,2% !0,4% !0,6% February 25, 2016 Data Mining: Concepts and Techniques 125 21 25.02.16 Normalisation can be used to transform data JaakViloandotherauthors U T:DataMining2009 127 Ex p re s s i on Pro if l er Data Reduction Strategies Chapter 2: Data Preprocessing n n Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization and concept hierarchy generation n Why data reduction? n A database/data warehouse may store terabytes of data n n n n Data cube aggregation: Dimensionality reduction — e.g., remove unimportant attributes Data Compression n Numerosity reduction — e.g., fit data into models n Discretization and concept hierarchy generation n Data Mining: Concepts and Techniques 129 Complex data analysis/mining may take a very long time to run on the complete data set Data reduction n Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies n Summary February 25, 2016 February 25, 2016 Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization and concept hierarchy generation n n Nominal — values from an unordered set, e.g., color, profession n Ordinal — values from an ordered set, e.g., military or academic rank n Summary February 25, 2016 Data Mining: Concepts and Techniques 130 Three types of attributes: n n Data Mining: Concepts and Techniques Discretization Chapter 2: Data Preprocessing n 128 131 Continuous — real numbers, e.g., integer or real numbers Discretization: n Divide the range of a continuous attribute into intervals n Some classification algorithms only accept categorical attributes. n Reduce data size by discretization n Prepare for further analysis February 25, 2016 Data Mining: Concepts and Techniques 132 22 25.02.16 Discretization and Concept Hierarchy n E.g. Discretization n n Reduce the number of values for a given continuous attribute by n dividing the range of the attribute into intervals n n Interval labels can then be used to replace actual data values n Supervised vs. unsupervised n Split (top-down) vs. merge (bottom-up) n Discretization can be performed recursively on an attribute n n Concept hierarchy formation n n concepts (such as numeric values for age) by higher level concepts n (such as young, middle-aged, or senior) n February 25, 2016 Data Mining: Concepts and Techniques 133 1580 8394 2700 0-20 21-40 41-60 61-105 1780 4200 4767 2900 Vs n Recursively reduce the data by collecting and replacing low level 0-18 y 19-65 y 66-105 February 25, 2016 Data Mining: Concepts and Techniques 134 Example of 3-4-5 Rule Segmentation by Natural Partitioning co u n t A simply 3-4-5 rule can be used to segment numeric data n into relatively uniform, “natural” intervals. n -$ 3 51 S tep 1: If an interval covers 3, 6, 7 or 9 distinct values at the -$ 1 5 9 M in S tep 2 : most significant digit, partition the range into 3 equi- (-$ 1 ,0 00 - 0) n (-$ 4 0 0 - 0 ) (-$ 4 0 0 -$ 3 0 0 ) If it covers 1, 5, or 10 distinct values at the most (-$ 3 0 0 -$ 2 0 0 ) significant digit, partition the range into 5 intervals February 25, 2016 -351,976.00 .. 4,700,896.50 (-$ 2 0 0 -$ 1 0 0 ) Data Mining: Concepts and Techniques 135 3 v alue ranges 1. (-1,000,000 .. 0] 2. (0 .. 1,000,000] 3. (1,000,000 .. 2,000,000] Adjust with real MIN and MAX (-400,000 .. 0] (0 .. 1,000,000] (1,000,000 .. 2,000,000] (2,000,000 .. 5,000,000] February 25, 2016 Data Mining: Concepts and Techniques ($ 1 ,0 0 0 - $ 2,00 0) (0 -$ 1 ,000) ($ 1 ,0 0 0 - $ 2, 00 0) (0 - $ 1 ,0 0 0 ) (0 $ 2 00) ($ 1,000 $ 1 ,2 00) ($ 2 0 0 $400) ($ 1 ,2 0 0 $ 1 ,4 0 0 ) ($ 1 ,4 0 0 $ 1 ,6 0 0 ) ($ 4 0 0 $600) ($ 6 0 0 $800) (-$ 1 0 0 0) February 25, 2016 ($ 1 ,6 0 0 ($ 1 ,8 0 0 $ 1 ,8 0 0 ) $ 2 ,0 0 0 ) ($ 8 0 0 $ 1 ,0 0 0 ) ($ 2 ,0 0 0 - $ 5, 00 0) ($ 2 ,0 0 0 $ 3 ,0 0 0 ) ($ 3 ,0 0 0 $ 4 ,0 0 0 ) ($ 4 ,000 $ 5 ,0 0 0) Data Mining: Concepts and Techniques 136 Recursive … Example MIN= -351,976.00 MAX=4,700,896.50 LOW = 5th percentile -159,876 HIGH = 95th percentile 1,838,761 msd = 1,000,000 (most significant digit) LOW = -1,000,000 (round down) HIGH = 2,000,000 (round up) 1. 2. 3. 4. M ax (-$ 4 0 0 -$ 5 ,00 0) S tep 4: significant digit, partition the range into 4 intervals $ 4 ,70 0 (-$ 1 ,0 0 0 - $2 ,00 0) S tep 3 : If it covers 2, 4, or 8 distinct values at the most $1 ,83 8 High (i.e, 9 5%-0 tile) Lo w= -$ 1 ,0 0 0 Hig h = $ 2 ,0 0 0 width intervals n p ro fit Lo w (i.e, 5 %-tile) msd = 1 0 , 00 1.1. 1.2. 1.3. 1.4. (-400,000 .. -300,000 ] (-300,000 .. -200,000 ] (-200,000 .. -100,000 ] (-100,000 .. 0 ] 2.1. 2.2. 2.3. 2.4. 2.5. (0 .. 200,000 ] (200,000 .. 400,000 ] (400,000 .. 600,000 ] (600,000 .. 800,000 ] (800,000 .. 1,000,000 ] 3.1. 3.2. 3.3. 3.4. 3.5. (1,000,000 .. 1,200,000 ] (1,200,000 .. 1,400,000 ] (1,400,000 .. 1,600,000 ] (1,600,000 .. 1,800,000 ] (1,800,000 .. 2,000,000 ] 4.1. (2,000,000 .. 3,000,000 ] 4.2. (3,000,000 .. 4,000,000 ] 4.3. (4,000,000 .. 5,000,000 ] 137 JaakViloandotherauthors U T:DataMining2009 138 23 25.02.16 Concept Hierarchy Generation for Categorical Data • Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts Automatic Concept Hierarchy Generation n – street < city < state < country • Specification of a hierarchy for a set of values by explicit data grouping Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set n The attribute with the most distinct values is placed at the lowest level of the hierarchy n Exceptions, e.g., weekday, month, quarter, year – {Urbana, Champaign, Chicago} < Illinois province_or_ state – E.g., only street < city, not others • Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values DataMining:Concepts andTechniques February 25, 2016 Chapter 2: Data Preprocessing Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization and concept hierarchy n n generation n Data Mining: Concepts and Techniques 140 Summary n n 674,339 distinct values street 139 365 distinct values 3567 distinct values city – E.g., for a set of attributes: {street, city, state, country} February 25, 2016 15 distinct values country • Specification of only a partial set of attributes n Summary Data preparation or preprocessing is a big issue for both data warehousing and data mining Discriptive data summarization is need for quality data preprocessing Data preparation includes n Data cleaning and data integration n Data reduction and feature selection n Discretization A lot a methods have been developed but data preprocessing still an active area of research February 25, 2016 Data Mining: Concepts and Techniques 141 February 25, 2016 Data Mining: Concepts and Techniques 142 References n D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. of ACM, 42:73-78, Communications 1999 n T. Dasu and T. Johnson. n T. Dasu, T. Johnson, S. Muthukrishnan, Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality Browser. SIGMOD’02. n H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Committee on Data Engineering, Bulletin of the Technical 20( 4) , December 1997 n D. Pyle. Data Preparation n E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. Technical Committee on Data Engineering. Vol.23, No.4 n V. Raman and J. Hellerstein. Potters Wheel: An Interactive Transformation, for Data Mining. Morgan Kaufmann, T. Redman. Data Quality: Management n Y. Wand and R. Wang. Anchoring n IEEE Bulletin of the for Data Cleaning and and Technology. Bantam Books, 1992 data quality dimensions ontological foundations. Communications of 1996 R. Wang, V. Storey, and C. Firth. A framework Knowledge and Data Engineering, February 25, 2016 Framework VLDB’2001 n ACM, 39:86-95, 1999 7:623-640, for analysis of data quality research. IEEE Trans. 1995 Data Mining: Concepts and Techniques 143 24