Download Statistics, Probability and Noise

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistics, Probability and Noise
Claudia Feregrino-Uribe & Alicia Morales-Reyes
Original material: Rene Cumplido
Autumn 2015, CCC-INAOE
Contents
Signal and graph terminology
Mean and standard deviation
Signal versus underlying process
The Histogram, Pmf and Pdf
The Normal Distribution
Digital noise generation
Precision and accuracy
} 
} 
} 
} 
} 
} 
} 
2
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Introduction
} 
Statistics and probability are used in DSP to characterize signals and the
processes that generate them.
} 
For example, a primary use of DSP is to reduce interference, noise, and
other undesirable components in acquired data.
} 
} 
} 
} 
Inherent part of the signal being measured
Imperfections in the data acquisition system
Introduced as an unavoidable byproduct of some DSP operation.
Statistics and probability allow these disruptive features to be measured
and classified
} 
3
Aid in developing strategies to remove the offending components.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal and Graph Terminology
} 
The vertical axis may represent voltage, light intensity, sound pressure,
etc.
} 
} 
4
Since we don't know what it represents in this particular case, we will label it:
amplitude.
This parameter is also called several other names: the y axis, the dependent
variable, the range, and the ordinate.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal and Graph Terminology
} 
The horizontal axis represents the other parameter of the signal, going by such names as:
the x-axis, the independent variable, the domain, and the abscissa.
} 
Time is the most common parameter to appear on the horizontal axis of acquired signals
} 
Other parameters are used in specific applications.
} 
Eg. rock density at equally spaced distances along the surface of the earth.
In general, label the horizontal axis: sample number.
} 
} 
5
If this were a continuous signal, label would be, eg.: time, distance, x, etc
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal and Graph Terminology
} 
The two parameters that form a signal are generally not interchangeable.
} 
The parameter on the y-axis (the dependent variable) is said to be a
function of the parameter on the x-axis (the independent variable).
} 
The independent variable describes how or when each sample is taken,
while the dependent variable is the actual measurement.
} 
Given a specific value on the x-axis, we can find the corresponding value on
the y-axis, but usually not the other way around.
6
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal and Graph Terminology
} 
Domain is a very widely used term in DSP.
} 
A signal that uses time as the independent variable is said to be in the time
domain.
} 
Another common signal in DSP uses frequency as the independent variable,
resulting in the term, frequency domain.
} 
Signals that use distance as the independent parameter are said to be in the
spatial domain.
} 
What if the x-axis is labeled with something like sample number?
} 
7
Refer to them as being in the time domain.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal and Graph Terminology
} 
Although the signals in previous figures are discrete, they are displayed in
this figure as continuous lines.
} 
There are too many samples to be distinguishable if they were displayed as
individual markers.
} 
In graphs that portray shorter signals ( <100) the individual markers are
usually shown.
} 
Continuous lines may or may not be drawn to connect the markers.
} 
} 
} 
A continuous line could imply what is happening between samples, or
simply be an aid to help the reader's eye follow a trend in noisy data.
Examine the labeling of the horizontal axis to find if you are working with a
discrete or continuous signal.
8
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal and Graph Terminology
Sampling notation:
} 
} 
The variable, N, is widely used in DSP to represent the total
number of samples in a signal.
} 
} 
} 
Two notations for assigning sample numbers are commonly
used.
} 
} 
9
Each sample is assigned a sample number or index.
These are the numbers that appear along the horizontal axis.
Sample indexes run from 1 to N (e.g., 1 to 512) - Math
Sample indexes run from 0 to N – 1 (e.g., 0 to 511) - DSP
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Mean
} 
The mean, indicated by µ is the statistician's jargon for the
average value of a signal.
} 
} 
Add all of the samples together, and divide by N.
Mathematically:
1
µ=
N
10
N −1
∑x
i
i =0
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Standard Deviation
} 
The standard deviation is a measure of how far the signal fluctuates from
the mean.
} 
The standard deviation is obtained by averaging the squares of
differences of each sample with the mean.
} 
The square root is taken to compensate for the initial squaring. In
equation form:
σ=
11
1 N −1
2
(
x
−
µ
)
∑ i
N − 1 i =0
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Common waveforms
} 
Ratio of the peak-to-peak amplitude of the std. dev. for several common
waveforms
12
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Running statistics
} 
} 
It is often desirable to recalculate the mean and standard
deviation as new samples are acquired and added to the signal.
This type of calculation is called running statistics.
N, the total number of
samples
sum, the sum of these
samples
sum of squares, the sum of
the squares of the samples
13
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal-to-Noise Ratio and Coefficient of
Variation
} 
In some situations, the mean describes what is being measured, while the
standard deviation represents noise and other Interference
} 
} 
This gives rise to the term: signal-to-noise ratio (SNR)
} 
} 
Mean divided by the standard deviation.
Another term used is the coefficient of variation (CV)
} 
} 
} 
The standard deviation is not important in itself, but only in comparison to the
mean.
standard deviation divided by the mean, multiplied by 100 percent.
E.g. a signal with a CV of 2%, has an SNR of 50
Better data means a higher value for the SNR and a lower value for the CV.
14
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal versus Underlying Process
} 
Statistics is the science of interpreting numerical data, such as
acquired signals.
} 
Probability is used in DSP to understand the processes that
generate signals.
} 
In DSP it is important to distinguish the acquired signal from
the underlying process.
15
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Signal versus Underlying Process
} 
E.g. A signal created by flipping a coin
1000 times.
} 
} 
The probabilities of the underlying
process are constant, but the statistics
of the acquired signal change each
time the experiment is repeated
} 
This random irregularity found in
actual data is called by such names as:
statistical variation, statistical
fluctuation, and statistical noise
The process that created this signal has a
mean of exactly 0.5
} 
} 
Heads -> one, Tails ->zero
} 
50% heads, 50% tails
The actual 1000 point signal will not
necessarily have a mean of exactly 0.5.
} 
16
Random chance will make the number of
ones and zeros slightly different each time
the signal is generated
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
The Histogram, Pmf and Pdf
} 
The histogram displays the number of
samples there are in a signal that have
each of the possible values.
} 
A histogram is represented by Hi
} 
} 
} 
} 
17
i is an index that runs from 0 to M-1
M is the number of possible values each
sample can take
E.g. H50 is the number of samples that
have a value of 50.
The sum of all values in the histogram
is equal to the number of points in the
signal.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
The Histogram, Pmf and Pdf
} 
} 
Mean
} 
Standard deviation
Efficient to calculate mean and std. dev.
of very large data sets
} 
} 
Images
Statistics are calculated per groups
of samples
18
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Probability mass function (pmf)
} 
Important: the acquired signal is a noisy version
of the underlying process
}  Histogram is formed from an acquired signal
and
}  calculated using a finite number of samples
} 
Corresponding underlying process is the
probability mass function (pmf)
} 
} 
Pmf is what would be obtained with an infinite
number of samples.
Pmf can be estimated (inferred) from the
histogram, or it may be deduced by some
mathematical technique
} 
Normalization, total number of samples
} 
Pmf is important because it describes the
probability that a certain value will be generated.
} 
Discrete data only
19
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Probability density function (pdf)
} 
Continuous signals
} 
Pdf or probability distribution function
} 
Pdf’s vertical axis is in probability
density units, rather than just
probability
} 
} 
How can we calculate
a probability?
20
Eg. 0.03 at 120.5 does not mean that the
a voltage of 120.5 millivolts will occur
3% of the time
Probability of 120.5 millivolts to occur
is really small > there is an infinite
number of signal values for time scale:
120.49997, 120.49998, 120.49999, etc
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
To remember
} 
The histogram, pmf, and pdf are very similar concepts. Try not
to be confused.
} 
The total area under the pdf curve, the integral from -∞ to
+∞, will always be equal to one.
} 
The sum of all of the pmf values being equal to one
} 
The sum of all of the histogram values being equal to N.
21
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Examples of continuous waveforms and
their pdfs
22
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
The Normal Distribution
} 
Signals formed from random processes usually have a bell
shaped pdf.
} 
This is called a normal distribution, a Gauss distribution, or a
Gaussian
} 
} 
After German mathematician, Karl F. Gauss (1777-1855)
The basic shape of the curve can be generated by:
23
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
The Normal Distribution
} 
This raw curve can be converted into the complete Gaussian
by adding an adjustable mean, µ, and standard deviation, σ.
} 
The equation must be normalized so that the total area under
the curve is equal to one.
} 
General form of the normal distribution.
24
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
The Normal Distribution
25
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Cumulative distribution function (cdf)
} 
Pdf integration is used to find the probability that a signal will
be within a certain range of values
} 
Pdf’s integral is called cumulative distribution function (cdf),
Φ(x)
} 
Gaussian’s integral is calculated by numerical integration
} 
Very fine discrete sampling of the continuous Gaussian curve, from
-10σ to +10σ
} 
Discrete signal’s samples are added to simulate integration
26
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Cumulative distribution
function (cdf)
Φ(x), the cumulative
distribution function of
the normal distribution
(mean = 0, standard
deviation = 1).
27
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Digital noise generation
} 
Random noise is an important topic in both electronics and
DSP.
} 
For example, it limits how small of a signal an instrument can
measure, the distance a radio system can communicate, and
how much radiation is required to produce an x-ray image.
} 
A common need in DSP is to generate signals that resemble
various types of random noise.
} 
28
This is required to test the performance of algorithms that must work in
the presence of noise.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Random number generator
} 
The heart of digital noise generation is the random number generator.
} 
Most programming languages have this as a standard function.
} 
Each random number has a value between zero and one, with an equal
probability of being anywhere between these two extremes.
} 
} 
} 
29
The mean of the underlying process that generated this signal is 0.5
The standard deviation is , and the 1/√12 = 0.29, and
The distribution is uniform between zero and one.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Digital noise generation (Gaussian)
} 
There are two methods for generating such signals using a
random number generator.
} 
} 
} 
} 
a signal obtained by adding two random numbers to form each sample,
i.e., X = RND+RND.
Since each of the random numbers can run from zero to one, the sum
can run from zero to two.
The mean is now one, and the standard deviation is 1/√6
The pdf has changed from a uniform distribution to a triangular
distribution.
} 
30
The signal spends more of its time around a value of one, with less time
spent near zero or two.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Digital noise generation (Gaussian) 2
31
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Digital noise generation (Gaussian) (3)
} 
Taking this idea a step further, adding twelve random numbers to produce
each sample.
} 
} 
} 
The mean is now six
The standard deviation is one.
The pdf has virtually become a Gaussian.
} 
This procedure can be used to create a normally distributed noise signal
with an arbitrary mean and standard deviation.
} 
For each sample in the signal:
} 
} 
} 
} 
32
1) add twelve random numbers
2) subtract six to make the mean equal to zero
3) multiply by the standard deviation desired
4) add the desired mean
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Digital noise generation (Gaussian) (4)
33
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Central Limit Theorem
} 
The mathematical basis for this algorithm is contained in the Central Limit
Theorem, one of the most important concepts in probability.
} 
In its simplest form, the Central Limit Theorem states that a sum of random
numbers becomes normally distributed as more and more of the random
numbers are added together.
} 
The Central Limit Theorem does not require the individual random
numbers be from any particular distribution,
} 
} 
Or even that the random numbers be from the same distribution
The Central Limit Theorem provides the reason why normally distributed
signals are seen so widely in nature
} 
34
Whenever many different random forces are interacting, the resulting pdf
becomes a Gaussian.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Digital noise generation – 2nd method
} 
A random number generator is invoked twice, to obtain R1 and R2.
} 
} 
A normally distributed random number, X, can then be found from:
Just as before, this approach can generate normally distributed random
signals with an arbitrary mean and standard deviation.
} 
35
Take each number generated by this equation multiply it by the desired
standard deviation, and add the desired mean.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Random number generators
} 
Random number generators operate by starting with a seed, a number
between zero and one.
} 
When the random number generator is invoked, the seed is passed
through a fixed algorithm, resulting in a new number between zero and
one.
} 
} 
} 
This new number is reported as the random number
It is then internally stored to be used as the seed the next time the random
number generator is called.
The algorithm that transforms the seed into the new random number is
often of the form:
36
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Precision and Accuracy
} 
Precision and accuracy are terms used to describe systems and methods
that measure, estimate, or predict.
} 
In all these cases, we wish to know the value of some parameter
} 
This is called the true value, or simply, truth.
} 
The method provides a measured value, that you want to be as close to the
true value as possible.
} 
Precision and accuracy are ways of describing the error that can exist
between these two values.
37
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Precision and Accuracy
} 
consider an oceanographer
measuring water depth using a
sonar system
} 
The mean occurs at the center of
the distribution
} 
} 
The standard deviation defines
distribution’s width
} 
} 
38
best estimate of the depth based
on all measured data
how much variation occurs
between successive measurements
Good accuracy, poor precision
}  Poor repeatability
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."
Precision and Accuracy
} 
Precision is a measure of
random noise
} 
When deciding which name to call
the problem, ask yourself two
questions.
} 
Poor accuracy results
from systematic errors
} 
First: Will averaging successive
readings provide a better
measurement?
} 
} 
Bad calibration
Eg. Converting time to
distance, how ?
} 
} 
} 
} 
Accuracy is a measure of
calibration
39
If yes, call the error precision
If no, call it accuracy
Second: Will calibration correct
the error?
} 
} 
If yes, call it accuracy
If no, call it precision.
"The Scientist and Engineer's Guide to Digital Signal
Processing, copyright ©1997-1998 by Steven W. Smith."