Download slides

Noise & Data Reduction        Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum Dimension Reduction Data Integration Automatic Concept Hierarchy Generation Testing Hypothesis Remember: Central Limit Theorem The sampling distribution of the mean of samples of size N approaches a normal (Gaussian) distribution as N approaches infinity. If the samples are drawn from a population with mean   and standard deviation , then the mean of the sampling distribution is  and its standard deviation is  x   N as N increases. These statements hold irrespective of the shape of the original distribution.  Z Test standard deviation (population) N x  2 1 Z    xi  x / N N i1    t Test x  t  s/ N  sample standard deviation N  1 s   xi  x N 1 i1  2 • when population standard deviation is unknown, samples are small  population mean , sample mean x p Values    Commonly we reject the H0 when the probability of obtaining a sample statistic given the null hypothesis is low, say < .05 The null hypothesis is rejected but might be true We find the probabilities by looking them up in tables, or statistics packages provide them   The probability of obtaining a particular sample given the null hypothesis is called the p value By convention, one usually dose not reject the null hypothesis unless p < 0.05 (statistically significant) Example    Five cars parked, mean price of the cars is 20.270 € and the standard deviation of the sample is 5.811€ The mean costs of cars in town is 12.000 € (population) H0 hypothesis: parked cars are as expensive as the cars in town 20270 12000 t  3.18 5811/ 5   For N-1 (degrees of freedom) t=3.18 has a value less than 0.025, reject H0! Paired Sample t Test  Given a set of paired observations  (from two normal populations) A B =A-B x1 y1 x1-x2 x2 y2 x2-y2 x3 y3 x3-y3 x4 y4 x4-y4 x5 y5 x5-y5 Calculate the mean x and the standard deviation s of the the differences   H0: =0 (no difference)  H0: =k (difference is a constant)   x   t  ˆ  s ˆ   N Confidence Intervals ( known)  Standard error from the standard deviation x     Population N 95 Percent confidence interval for normal distribution is about the mean x 1.96  x Confidence interval when ( unknown) s ˆx   N  Standard error from the sample standard deviation  95 Percent confidence interval for t distribution (t0.025 from a table) is 0.025 x xt ˆ     Previous Example: Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Overview Data Transformation Reduce Noise  Reduce Data  Data Transformation  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range   min-max normalization  z-score normalization  normalization by decimal scaling Attribute/feature construction  New attributes constructed from the given ones Data Transformation: Normalization  Min-max normalization: to [new_minA, new_maxA] v  minA v'  (new _ maxA  new _ minA)  new _ minA maxA  minA  Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 73,600  12,000 (1.0  0)  0  0.716 98,000  12,000  Z-score normalization (μ: mean, σ: standard deviation): v'    v  A  A Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling 73,600  54,000  1.225 16,000 How to Handle Noisy Data? (How to Reduce Features?)  Binning    Regression   smooth by fitting the data into regression functions Clustering   first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. detect and remove outliers Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers) Data Reduction Strategies  A data warehouse may store terabytes of data   Data reduction   Complex data analysis/mining may take a very long time to run on the complete data set Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies      Data cube aggregation Dimensionality reduction—remove unimportant attributes Data Compression Numerosity reduction—fit data into models Discretization and concept hierarchy generation Simple Discretization Methods: Binning  Equal-width (distance) partitioning:      Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning:    Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky. Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries (min and max are identified, bin value replaced by the closesed boundary value): - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Cluster Analysis Regression y Y1 Y1’ y=x+1 X1 x Heuristic Feature Selection Methods   There are 2d -1 possible sub-features of d features Several heuristic feature selection methods:      Best single features under the feature independence assumption: choose by significance tests Best step-wise feature selection: • The best single-feature is picked first • Then next best feature condition to the first, ... Step-wise feature elimination: • Repeatedly eliminate the worst feature Best combined feature selection and elimination Optimal branch and bound: • Use feature elimination and backtracking Sampling: with or without Replacement Raw Data From Covariance Matrix to PCA Principal Component Analysis and Dimension Reduction X2 Y1 Y2 X1 Feature space  Sample x (1) (2) , x ,.., x ,.., x x1  x 2  d  x    ..    ..   x d  (k) xy   (n ) d 2 (x  y )  i i i1 Scaling  A well-known scaling method consists of performing some scaling operations     subtracting the mean and dividing the standard deviation (x i  mi ) yi  si mi sample mean si sample standard deviation  According to the scaled metric the scaled feature vector is expressed as (x i  mi ) 2 || y ||s   2 si i1 n  shrinking large variance values   stretching low variance values   si > 1 si < 1  Fails to preserve distances when general linear transformation is applied!  Covariance   Measuring the tendency two features xi and xj varying in the same direction The covariance between features xi and xj is estimated for n patterns n c ij   k1  x i  mi x j (k ) n 1 (k )  mj  c11 c12  c c 21 22  C  .. ..  c d1 c d 2 .. c1d   .. c 2d  .. ..   .. c dd  Correlation Covariances are symmetric cij=cji  Covariance is related to correlation  n rij   x k1 (k ) i   mi x j (k ) (n 1)si s j  mj   c ij si s j  1,1 Karhunen-Loève Transformation  Covariance matrix C of (a d d matrix)  Symmetric and positive definite U CU    diag(1, 2 ,..., d ) T I  Cu  0    There are d eigenvalues and eigenvectors Cui  ui is the i ith eigenvalue of C and ui the ith column of U, the ith eigenvectors     Eigenvectors are always orthogonal U is an orthonormal matrix UUT=UTU=I U defines the K-L transformation The transformed features by the K-L transformation are given by y  Ux  (linear Transformation) K-L transformation rotates the feature space into alignment with uncorrelated features Example 2 1 C    1 1  I  C  0 1=2.618 2=0.382 0.618  1 u1    0 1.618u2   1  u(1)=[1 0.618] u(2)=[-1 1.618]   3  1  0 2 Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. PCA (Principal Components Analysis)     New features y are uncorrelated with the covariance Matrix Each eigenvector ui is associated with some variance associated by i Uncorrelated features with higher variance (represented by i) contain more information Idea:  Retain only the significant eigenvectors ui Dimension Reduction  How many eigenvectors (and corresponding eigenvector) to retain  Kaiser criterion  Discards eigenvectors whose eigenvalues are below 1 Problems  Principal components are linear transformation of the original features  It is difficult to attach any semantic meaning to principal components Fourier Analysis    It is always possible to analyze „complex“ periodic waveforms into a set of sinusoidal waveforms Any periodic waveform can be approximated by adding together a number of sinusoidal waveforms Fourier analysis tells us what particular set of sinusoids go together to make up a particular complex waveform Spectrum  In the Fourier analysis of a complex waveform the amplitude of each sinusoidal component depends on the shape of particular complex wave • Amplitude of a wave: maximum or minimum deviation from zero line • T duration of a period 1 • f  T Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Noise reduction or Dimension Reduction  It is difficult to identify the frequency components by looking at the original signal  Converting to the frequency domain  If dimension reduction, store only a fraction of frequencies (with high amplitude)  If noise reduction    (remove high frequencies, fast change, smoothing) (remove low frequencies, slow change, remove global trends) Inverse discrete Fourier transform Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“ benötigt. Dimensionality Reduction: Wavelet Transformation Haar2 Daubechie4  Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis  Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients  Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space  Method:  Length, L, must be an integer power of 2 (padding with 0’s, when necessary)  Each transform has 2 functions: smoothing, difference  Applies to pairs of data, resulting in two set of data of length L/2  Applies two functions recursively, until reaches the desired length Data Integration  Data integration:   Schema integration: e.g., A.cust-id  B.cust-#   Integrate metadata from different sources Entity identification problem:   Combines data from multiple sources into a coherent store Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts   For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs. British units Handling Redundancy in Data Integration  Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant attributes may be able to be detected by correlation analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Automatic Concept Hierarchy Generation  Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set   The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year country 15 distinct values province_or_ state 365 distinct values city 3567 distinct values street 674,339 distinct values        Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum Dimension Reduction Data Integration Automatic Concept Hierarchy Generation Mining Association rules  Apriori Algorithm (Chapter 6, Han and Kamber) 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides