Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Noise & Data Reduction







Paired Sample t Test
Data Transformation - Overview
From Covariance Matrix to PCA and Dimension
Reduction
Fourier Analysis - Spectrum
Dimension Reduction
Data Integration
Automatic Concept Hierarchy Generation
Testing Hypothesis
Remember:
Central Limit Theorem
The sampling distribution of the mean of samples of size
N approaches a normal (Gaussian) distribution as N
approaches infinity.
If the samples are drawn from a population with mean 

and standard deviation , then the mean of the sampling
distribution is  and its standard deviation is  x   N as
N increases.
These statements hold irrespective of the shape of the
original distribution.

Z Test
standard deviation (population)
N
x 
2
1
Z

  xi  x
/ N
N i1



t Test
x 
t

s/ N

sample standard deviation
N

1
s
  xi  x
N 1 i1

2
• when population standard deviation is unknown, samples
are small

population mean , sample mean
x
p Values



Commonly we reject the H0 when the
probability of obtaining a sample statistic given
the null hypothesis is low, say < .05
The null hypothesis is rejected but might be true
We find the probabilities by looking them up in
tables, or statistics packages provide them


The probability of obtaining a particular sample given
the null hypothesis is called the p value
By convention, one usually dose not reject the null
hypothesis unless p < 0.05 (statistically significant)
Example



Five cars parked, mean price of the cars is 20.270 €
and the standard deviation of the sample is 5.811€
The mean costs of cars in town is 12.000 €
(population)
H0 hypothesis: parked cars are as expensive as the
cars in town
20270 12000
t
 3.18
5811/ 5


For N-1 (degrees of freedom) t=3.18 has a value less
than 0.025, reject H0!
Paired Sample t Test

Given a set of paired observations

(from two normal populations)
A
B
=A-B
x1 y1
x1-x2
x2 y2
x2-y2
x3 y3
x3-y3
x4 y4
x4-y4
x5 y5
x5-y5
Calculate the mean x and the standard
deviation s of the the differences 
 H0: =0 (no difference)
 H0: =k (difference
is a constant)


x  
t 
ˆ

s
ˆ 

N
Confidence Intervals ( known)

Standard error from the standard deviation
x 


 Population
N
95 Percent confidence interval for normal distribution is about
the mean
x 1.96  x
Confidence interval
when ( unknown)
s
ˆx 

N

Standard error from the sample standard deviation

95 Percent confidence interval for t distribution (t0.025 from a table)
is
0.025
x
xt
ˆ
 


Previous
Example:
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Overview Data Transformation
Reduce Noise
 Reduce Data

Data Transformation

Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small, specified
range


min-max normalization

z-score normalization

normalization by decimal scaling
Attribute/feature construction

New attributes constructed from the given ones
Data Transformation:
Normalization

Min-max normalization: to [new_minA, new_maxA]
v  minA
v' 
(new _ maxA  new _ minA)  new _ minA
maxA  minA

Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to 73,600  12,000 (1.0  0)  0  0.716
98,000  12,000

Z-score normalization (μ: mean, σ: standard deviation):
v' 


v  A

A
Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
73,600  54,000
 1.225
16,000
How to Handle Noisy Data?
(How to Reduce Features?)

Binning



Regression


smooth by fitting the data into regression functions
Clustering


first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
detect and remove outliers
Combined computer and human inspection

detect suspicious values and check by human (e.g.,
deal with possible outliers)
Data Reduction Strategies

A data warehouse may store terabytes of data


Data reduction


Complex data analysis/mining may take a very long time to run on the
complete data set
Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results
Data reduction strategies





Data cube aggregation
Dimensionality reduction—remove unimportant attributes
Data Compression
Numerosity reduction—fit data into models
Discretization and concept hierarchy generation
Simple Discretization Methods:
Binning

Equal-width (distance) partitioning:





Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning:



Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries (min and max are identified, bin value
replaced by the closesed boundary value):
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Cluster Analysis
Regression
y
Y1
Y1’
y=x+1
X1
x
Heuristic Feature Selection
Methods


There are 2d -1 possible sub-features of d features
Several heuristic feature selection methods:





Best single features under the feature independence assumption:
choose by significance tests
Best step-wise feature selection:
• The best single-feature is picked first
• Then next best feature condition to the first, ...
Step-wise feature elimination:
• Repeatedly eliminate the worst feature
Best combined feature selection and elimination
Optimal branch and bound:
• Use feature elimination and backtracking
Sampling: with or without Replacement
Raw Data
From
Covariance
Matrix
to PCA
Principal
Component
Analysis
and Dimension Reduction
X2
Y1
Y2
X1
Feature space

Sample
x
(1)
(2)
, x ,.., x ,.., x
x1

x
2

d
 x  
 ..  
 ..


x d

(k)
xy 

(n )
d
2
(x

y
)
 i i
i1
Scaling

A well-known scaling method consists of performing
some scaling operations




subtracting the mean and dividing the standard deviation
(x i  mi )
yi 
si
mi sample mean
si sample standard deviation

According to the scaled metric the scaled feature vector
is expressed as
(x i  mi ) 2
|| y ||s  
2
si
i1
n

shrinking large variance values


stretching low variance values


si > 1
si < 1

Fails to preserve distances when general linear
transformation is applied!

Covariance


Measuring the tendency two features xi and xj
varying in the same direction
The covariance between features xi and xj is
estimated for n patterns
n
c ij 

k1

x i  mi x j
(k )
n 1
(k )
 mj

c11 c12

c
c
21
22

C
 ..
..

c d1 c d 2
.. c1d 

.. c 2d 
.. .. 

.. c dd 
Correlation
Covariances are symmetric cij=cji
 Covariance is related to correlation

n
rij 
 x
k1
(k )
i

 mi x j
(k )
(n 1)si s j
 mj


c ij
si s j
 1,1
Karhunen-Loève
Transformation

Covariance matrix C of (a d d matrix)

Symmetric and positive definite
U CU    diag(1, 2 ,..., d )
T
I  Cu  0



There are d eigenvalues and eigenvectors
Cui  ui
is the i ith eigenvalue of C and ui the ith column of
U, the ith eigenvectors




Eigenvectors are always orthogonal
U is an orthonormal matrix UUT=UTU=I
U defines the K-L transformation
The transformed features by the K-L
transformation are given by
y  Ux

(linear Transformation)
K-L transformation rotates the feature space
into alignment with uncorrelated features
Example
2 1
C   
1 1

I  C  0
1=2.618 2=0.382
0.618

1 u1

  0
1.618u2 
 1

u(1)=[1 0.618] u(2)=[-1 1.618]
  3  1  0
2
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
PCA (Principal Components
Analysis)




New features y are uncorrelated with the
covariance Matrix
Each eigenvector ui is associated with some
variance associated by i
Uncorrelated features with higher variance
(represented by i) contain more information
Idea:

Retain only the significant eigenvectors ui
Dimension Reduction

How many eigenvectors (and
corresponding eigenvector) to retain

Kaiser criterion

Discards eigenvectors whose eigenvalues
are below 1
Problems

Principal components are linear
transformation of the original features

It is difficult to attach any semantic meaning
to principal components
Fourier Analysis



It is always possible to analyze „complex“
periodic waveforms into a set of sinusoidal
waveforms
Any periodic waveform can be approximated by
adding together a number of sinusoidal
waveforms
Fourier analysis tells us what particular set of
sinusoids go together to make up a particular
complex waveform
Spectrum

In the Fourier analysis of a complex
waveform the amplitude of each sinusoidal
component depends on the shape of
particular complex wave
• Amplitude of a wave: maximum or minimum
deviation from zero line
• T duration of a period
1
• f 
T
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Noise reduction or
Dimension Reduction

It is difficult to identify the frequency components by looking at the
original signal

Converting to the frequency domain

If dimension reduction, store only a fraction of frequencies (with
high amplitude)

If noise reduction



(remove high frequencies, fast change, smoothing)
(remove low frequencies, slow change, remove global trends)
Inverse discrete Fourier transform
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (Unkomprimiert)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (Unkomprimiert)“
benötigt.
Dimensionality Reduction:
Wavelet Transformation
Haar2
Daubechie4

Discrete wavelet transform (DWT): linear signal
processing, multi-resolutional analysis

Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients

Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space

Method:

Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)

Each transform has 2 functions: smoothing, difference

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length
Data Integration

Data integration:


Schema integration: e.g., A.cust-id  B.cust-#


Integrate metadata from different sources
Entity identification problem:


Combines data from multiple sources into a coherent store
Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
Detecting and resolving data value conflicts


For the same real world entity, attribute values from different sources are
different
Possible reasons: different representations, different scales, e.g., metric vs.
British units
Handling Redundancy in Data
Integration

Redundant data occur often when integration of multiple databases

Object identification: The same attribute or object may have different
names in different databases

Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue

Redundant attributes may be able to be detected by correlation
analysis

Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Automatic Concept Hierarchy
Generation

Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the data
set


The attribute with the most distinct values is placed at the lowest level
of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country
15 distinct values
province_or_ state
365 distinct values
city
3567 distinct values
street
674,339 distinct values







Paired Sample t Test
Data Transformation - Overview
From Covariance Matrix to PCA and Dimension
Reduction
Fourier Analysis - Spectrum
Dimension Reduction
Data Integration
Automatic Concept Hierarchy Generation
Mining Association rules
 Apriori Algorithm (Chapter 6, Han and
Kamber)
