Download ERROR ANALYSIS: Useful Probability Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
ESS 265 Spring Quarter 2009
Data Formats (CDF, ASCII, FLATFILES)
Error Analysis
Probability Distributions
Binning and Histograms
Examples using Kp and Dst indices
1
Lecture 1
March 30, 2009
Formats of Data Files
•
Time series data are stored in a variety of formats. These include:
– ASCII (American Standards Code for Information Interchange) and binary tablesthe most common forms of data format for time series data. Data in other formats
frequently converted to tables.
– Common Data Format (CDF) – developed by the National Space Science Data
Center (NSSDC) for all types of data. Used for the International Solar Terrestrial
Physics program. Requires NSSDC-provided software.
– Flexible Image Transport System (FITS) – the only format allowed by the
astronomy community, a lot of use for images. Has been tried for time series
data without success.
– Hierarchical Data Format (HDF) – developed by the National Partnership for
Computing Infrastructure (NPACI). Frequently used for results from simulations.
NPACI provides software.
– Standard Format Data Units (SFDU) – used by all space faring nations to label
raw telemetry data. An international standard that is rarely used for processed
scientific data.
•
•
Binary data is the most common form of data compression with a savings of
about a factor of 3 over ASCII data.
zip, gzip lossless compression often used on ASCII data for fast transfer
2
Tables, Flat Files and Relations
•
•
•
•
•
Tables are the simplest way to represent time series data.
A table is defined as “a compact arrangement of related facts, figures,
values in an orderly sequence usually in rows and columns” – McPherron
If all records in a file are identical and are simply a series of rows in a table,
the file is called a flat file.
In some formats the file may have a variable sequence of records of
different types in which one must read each record in sequence to
determine what records are coming next. They can be exceedingly difficult
to read.
The dependent variable y is shown as a function of several independent
variables x1, x2, ... xm. If y is the Dst index and x1 is time then the table
would contain a time series. A flat file is also called a relation. The table
displays information about the connection between the various quantities
contained in the table. A column of a table is normally a sequence of
samples of a single variable. In contrast, a row of the table is called a tuple,
a set of simultaneous measurements of a set of variables. Tuple is an
abstraction of the sequence: Single, Double, Triple, Quadruple, Quintuple,
N-tuple. A complex number is a 2-tuple, or pair. A Quaternion is a 4-tuple or
Quadruple. Note that a time series is a specific type of table or relation in
which the order of values is important.
3
A Relation
•Assume n sets of observations of a
dependent variable y which is a
function of m independent variables
x1,x2,….xm.
•The relation can be represented by
a flat file.
•Each column is a variable and
each row is a tuple.
•Model the relation with a
regression equation that combines
the m variables.
4
Tables and Metadata
•
•
•
•
The simplest way to store a table in a computer is as an ASCII file containing
a sequence of identical records. Such files are easy to read since every
record has the same format. They are also simple to view since they may be
opened and edited in any text editor. A more compact version of the same flat
file would be in binary format. While these files are still flat they cannot be
viewed or edited without first converting to ASCII format.
Time is usually represented in seconds or milliseconds since a certain date.
UCLA practice has time in seconds since 1966-Jan-01, ignoring leap
seconds. IDL time is in seconds since 1970, which is the same as UNIX time.
One must know the format of the data record. This includes the number of
columns, the widths of the columns, how the values are represented, the
names of the columns, the units of the variable etc. Such data are called
metadata.
Binary data tables also are used.
– Much data at UCLA is in the form of binary flat files.
– Lower flat files contain embedded metadata (header information).
• Lower.ffd contains binary data, Lower.ffh contains ascii headers
– Upper flat files are completely flat with detached detached metadata.
• Upper.DAT contains binary data, Upper.DES contains data description
5 on the data, Upper.ABS is abstract on data
• Upper.HED contains header information
UCLA Lower Flatfile Header (Metadata) Example
DATA = SDT.Export.BZGSE.UnNamed.ffh
CDATE = Wed Jun 5 10:48:14 19960
RECL = 12
NCOLS = 2
NROWS =
3826
OPSYS = SUN/UNIX
# NAME
UNITS SOURCE
001 UT
SECS
UNIVERSAL TIME
002 BZGSE nT
BMag_Angles
#########
SDT EXPORT FLAT FILE ABSTRACT
FileName: SDT.Export.BZGSE.UnNamed
Format: UCLA Flatfile
Date/Time: Wed Jun 5 10:48:14 19960
SDT Version:2.3
Comment: test_comment
#########
Name:
BMag_Angles
Time:
1995/10/18/00:00:00
Points: 3826
Components: 7
Component Depths: 1 1 1 1 1 1 1
#########
FLAT FILE MAKER:SDT Export Flatfile
INPUT FROM: Geotail Minute Survey
6
12 bytes per record
2 Columns
TYPE LOC
T
0
R
8
3826 Rows
An ASCII Flat File- Galileo Magnetometer Data
during the G8 Flyby
1997-05-07T15:36:55.133
1997-05-07T15:36:55.467
1997-05-07T15:36:55.800
1997-05-07T15:36:56.133
1997-05-07T15:36:56.467
1997-05-07T15:36:56.800
1997-05-07T15:36:57.133
1997-05-07T15:36:57.467
1997-05-07T15:36:57.800
1997-05-07T15:36:58.133
1997-05-07T15:36:58.467
1997-05-07T15:36:58.800
1997-05-07T15:36:59.133
1997-05-07T15:36:59.467
1997-05-07T15:36:59.800
1997-05-07T15:37:00.133
-8.36
-8.38
-8.41
-8.44
-8.50
-8.47
-8.60
-8.47
-8.44
-8.41
-8.39
-8.37
-8.37
-8.35
-8.36
-8.31
-25.04
-25.16
-25.09
-25.08
-25.16
-25.18
-25.20
-25.04
-25.04
-25.30
-25.27
-25.01
-24.98
-24.93
-24.71
-24.78
-85.24
-85.22
-85.24
-85.27
-85.19
-85.18
-85.18
-85.12
-85.17
-85.06
-85.00
-85.09
-85.12
-85.24
-85.26
-85.30
7
89.23
89.25
89.25
89.28
89.23
89.23
89.25
89.13
89.17
89.14
89.08
89.09
89.11
89.20
89.16
89.21
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-1.57
-3.68
-3.68
-3.67
-3.67
-3.67
-3.67
-3.67
-3.67
-3.67
-3.67
-3.67
-3.66
-3.66
-3.66
-3.66
-3.66
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
0.65
ERROR ANALYSIS: Some Nomenclature
•
Systematic errors – Reproducible errors that result from calibration errors or
bias on the part of the observer. Sometimes data can be corrected for these
errors but in other cases we must estimate these errors and combine them
with errors from statistical fluctuations.
•
Accuracy – otherwise called “Absolute Accuracy” is a measure of how close
an observation comes to the true value. How well we compensate for
systematic errors. E.g. Magnetometer accuracy is how far the measurement
is from absolute value of the B-field in nT, and is order of 1nT for fluxgates
(including long term drifts) and 0.01nT for Vector Helium magnetometers.
Relative inter-spacecraft accuracy is the systematic difference in
measurement between two nearby spacecraft.
•
Precision – a measure of how a result was obtained, how reproducible it is.
How well we overcome random errors.
•
Uncertainty –Refers to the difference between a result and a true value.
Often we don't know what the "true" value so we must estimate the error.
Repeated measurements of the same thing will differ and we can only talk
about the discrepancy between these measurements- this is uncertainty.
•
Probable error- A measure of the magnitude of the error we estimate. For
two identical measurements it is a measure of the probable discrepancy.
8
ERROR ANALYSIS: Some Definitions
•Parent population – Set of data points from which experimental data are
assumed to be a random sample.
•Parent distribution – Probability distribution P(x) determining the choice
of sample data from parent population. Usually normalized to 1.
1
 n
•Expectation value
 f ( x ) P( x )    f ( x ) P( x)
f ( x)  lim
f (x ) 
N 
 N 
i


j 1
j
j


•Median m1/2 is defined as such that P(xi≤m1/2 )=P(xi≥m1/2 )=1/2
•Most probable value mmax is defined such that P(mmax)≥P(x≠mmax).
•Mean- m<x>
•Average deviation –   xi  m
•Variance -
 2  xi  m 2  x 2  m.2
•Standard deviation •Sample mean - x 
   2.
1
N
x .
i
•Sample variance – Best estimate of the parent standard variance
s2 
1
( xi  x ) 2

N 1
9
ERROR ANALYSIS: Useful Probability
Distributions: The Binomial Distribution
•
Measures the probability of observing x successes in n tries when the probability of
success in each try is p (not to be confused with bimodal distribution).
PB x, n, p  
•
n!
n x
p x 1  p 
x!n  x !
The mean is given by
n
n!
n x
p x 1  p   np
x!n  x !
x 0
For a binomial distribution the average of the number of successes approaches the
mean value given by product of the probability of success of each item times the
number of items.
The variance is given by n
m x
•
•
•
•
 2   x  m 2
n!
n x
p x 1  p   np1  p 
x!n  x !
x 0
For the case of a coin toss p=1/2 and the distribution is symmetric about the mean
and the median and most probable value are equal to the mean. The variance is half
of the mean.
In probability theory a random variable, x, has a binomial distribution B(n,p) where n
is the number of tries. It can be approximated by the normal distribution N when n is
large. N converges towards the Poisson distribution when the number of trials n goes
to infinity and the product m=np remains fixed.
10
ERROR ANALYSIS: Useful Probability
Distributions: The Poisson Distribution
•
•
•
A Poisson distribution occurs when p<<1 and m=np is constant.
It frequently is useful for counting experiments such as particle detectors.
It describes the probability of observing x events per unit time out of n
possible events each of which has a probability of p of occurring.
lim
m x m
PP x, m  
PB x, n, p  
e
p0
x!
•
The mean of the Poisson distribution must be the parameter m in the above
equation.
x
x 1
y



x  x
x 0
•
x!
e  m  me  m 
x 1
The variance is
m
x  1!
 me  m 
y 0
x

2 m
m 
  x  m    x  m 
e m
x
!
x 0 

The standard deviation is the square root of the mean.
2
•
m
2

11
m
y!
m
ERROR ANALYSIS: Useful Probability
Distributions: The Gaussian Distribution
•
•
•
•
•
•
The Gaussian distribution results from the case where the number of
possible different observations (n) is infinitely large and probability of
success is finitely large so that np>>1.
It works for many physical systems. It is also called normal distribution.
 1  x  m 2 
1
PG x, m ,   
exp  
 
2

 2
 
 
The Gaussian distribution is a continuous function describing the probability
a random observation x will occur from a parent distribution with mean m
and standard deviation .
The probability function is defined so that probability (dPG(x,m,) that a
random observation will fall in an interval dx about x is dPG(x,m,)
=P’G(x,m,) dx.
The width of a Gaussian is usually expressed as the full-width at half
maximum – it is given by 2.354.
The probable error (P.E.) is defined so that half the observations of an
experiment are expected to fall within m±P.E. (the probability of any
deviation is less is equal to ½). P.E. = 0.6745.
12
13
Propagation of Errors
•
•
•
•
•
In general we do not know the actual errors in the determinations of
parameters. Instead we use some estimate (e.g. ) of the error in each
parameter.
Assume that x=f(u,v…) and that x  f (u , v ...).
The uncertainty in x can be found by considering the spread in xi resulting
from the spread in the individual measurements ui,vi....
lim 1
xi  x.2
The variance is given by  x2 

N  N
Expand xi  x
 x 
 x 
xi  x  ui  u    vi  v    ...
 u 
 v 
2


 x 
 x 
 ui  u  u   vi  v  v   ...


2
2


lim 1
 x  x 
2  x 
2  x 

 ui  u   u   vi  v   v   2ui  u vi  v  u  v   ...
N  N


lim 1
 x2 
N  N
•
•
The first two terms can be expresses in terms of the variances of u2 and
v2.
The third term is related to the covariance uv2
 uv2 
lim 1
N  N
14 u

i
 u vi  v 
Propagation of Errors 2
•
•
•
The standard deviation of x is given by
2
2
 x  x 
2
2  x 
2  x 
 x   u     v    2 uv2     ...
 u 
 v 
 u  v 
If u and v are uncorrelated then uv2=0.
Specific combinations
x  au  bv :  x2  a 2 u2  b 2 v2  2ab uv2
x   auv :
 x2
x2

 u2
u2

 v2
v2
2
 uv2
uv
 uv2
au  x2  u2  v2
x
: 2  2  2 2
v x
u
v
uv
x  au b :
x
x  ae bu :
x
x
x
b
u
u
 b u
x  a ln  bu  :  x  a
u
15
u
DISCRETE DISTRIBUTIONS: Measures of
Central Tendency: Mean, Median and Mode
• There are several common quantitative measures of the tendency for a variable to
cluster around a central value including the mean, median, and mode.
–The mean of a set of Ntot observations of a discrete variable xi is defined as
–The median of a probability distribution function (pdf) p(x) is the value of xmed
for which larger and smaller values are equally probable. For discrete values, sort
the samples xi into ascending order and if Ntot is odd find the value of xi that has
equal numbers of points above and below it. If it is even this is not possible so
instead take the average of the two central values of the sorted distribution.
–The mode is defined as the value of xi corresponding to the maximum of the pdf.
For a quantized variable like the Kp index this corresponds to the discrete value
of Kp that occurs most frequently. More generally it is taken to be the value at the
center of the bin containing the largest number of values. For continuous
variables the definition depends on the width of bins used in determining the
histogram. If the bins are too narrow there will be large fluctuations in the
estimated pdf from bin to bin. If the bins are too large the location of the mode will
be poorly resolved.
16
More on the Mode
•
•
•
•
It is not necessary to create a histogram to obtain the mode of a distribution
[Press et al., 1986, page 462]. It can be calculated directly from the data in
the following manner.
Sort the data in ascending order.
Choose a window width of J samples (J >= 3).
For every i = 1, 2, …, Ntot–J estimate the pdf by using the formula
J
1

p x j  xi  j  
2
 NTot xi  J  xi 
Take as the mode the value of [xi + xi+j]/2 corresponding to the largest
estimate of the pdf.
A section in Press et al. (1986) describes a complex procedure for choosing
the most appropriate value of J.

•
•

17
The Probability Distribution Function 1
•
Probability is the statistical concept that describes the likelihood of the
occurrence of a specific event. It is estimated as the ratio of the number of
ways the specific event might occur to the total number of all possible
occurrences, i.e. P(x) = N(x)/Ntot. Suppose we have a random variable X
with values lying on the x axis. The probability density p(x) for X is related to
probability through an integral
•
Suppose we have a sample set of Ntot observations of the variable X. The
probability distribution function (pdf) for this variable at the point xi is defined
as
•
Here Dx is the interval (or bin) of x over which occurrences of different
values of X are accumulated, N[xi, xi+Dx] is the number of events found in
the bin between xi and xi+Dx, and Ntot is the total number of samples in the
set of observations of X.
18
The Probability Distribution Function 2
•
•
•
•
Usually the sample set is not large enough to allow the limit to be achieved
so that the pdf is approximated over a set of equal width bins defined by the
bin edges {xi} = {x0, x0+Dx, x0+2Dx, x0+3Dx, …,, x0+mDx}.
Normally x0 and x0+mDx are chosen so that all points in the sample set fall
between these two limits.
A plot of the quantity N[xi, xi+Dx] calculated for all values of x with a fixed Dx
is called a frequency histogram. The plot is called a probability histogram
when the frequency of occurrence in each bin is normalized by the total
number of occurrences, Ntot. The sum of all values of a probability
histogram is 1.0.
If the bin width is changed the occurrence probabilities will also change. To
compensate for this the probability histogram is additionally normalized by
the width of the bin to obtain the probability density function which we refer
to as the probability distribution function. The sum of all values of the
probability density distribution equals 1/Dx. The bin width Dx is usually fixed,
but in cases where some bins have very low occurrence probability it may
be necessary to increase Dx as a function
of x.
19
Kp and Dst
•The Kp index is a measure of the
strength of geomagnetic variations
with period shorter than 3 hours
caused mainly by magnetospheric
substorms. The index is roughly
proportional to the logarithm of the
range of deviation of the most
disturbed horizontal component of
the magnetic field from a quiet day
in a 3-hr interval.
•Kp is available continuously from the
beginning of 1932. It is dimensionless
and quantized in multiples of 1/3. Its
range is finite and limited to the interval
[0, 9]. In the following section Kp is one
of time series we use to illustrate some
commonly used statistical techniques
20
•The Dst (disturbance storm time)
index is a measure of the strength of
the ring current created by the drifts
of charged particles in the earth’s
magnetic field.
•A rapid decrease in Dst is an
indication that the ring current is
growing, and that a magnetic storm
is in progress. Ideally Dst is linearly
proportional to the total energy of
the drifting particles. Sym-H is
higher resolution.
A Histogram of Kp
• The Kp index is what is called a
"categorical variable“. It can only take
on a limited number of discrete values.
By definition Kp ranges from a value of
0.0 meaning geomagnetic activity is
very quiet, to 9.0 meaning that it is
extremely disturbed. In this limited
range it can assume 28 values
corresponding to bins of width 1/3. Kp
has no units because the numbers
refer to classes of activity.
•The values of 0 and 9 are quite rare,
since most of the time activity is
slightly disturbed. A useful way to
visualize the distribution of values
assumed by the Kp index is to create a
histogram.
•A histogram consists of a set of equal
width bins that span the dynamic
range of a variable
• If the number of occurrences in
each bin is normalized by the total
number of samples of Kp one obtains
the probability of occurrence of a
given value.
21
•If in addition we divide by the width
of the bin we obtain the probability
density function (pdf). discussed in a
later page.
Measures of Dispersion
•
It is obvious from the Kp histogram that values of this variable are spread around a
central value. Three standard measures of this dispersion include the mean absolute
deviation, the standard deviation, and the interquartile range. The mean absolute
deviation (mad) is defined by the formula
The standard deviation (root mean square) is given by
•
•
The upper and lower quartiles are defined in the same way as the median except
that the values ¼ and ¾ are used instead of ½.
The interquartile range (iqr) is the difference between the upper and lower quartiles
(Q3 and Q1)
For variables with a Gaussian pdf, 68% of all data values will lie within ±1 std of the
mean. Similarly, by definition 50% of the data values fall within the interquartile
range. Note that the standard deviation is more sensitive to values far from the mean
than is the average absolute deviation.
22
Measures of Asymmetry and Shape
•
The standard measure of asymmetry of a pdf is called skewness. It is
defined by the third moment of the probability distribution. For discrete data
the definition reduces to
•
Because of the standard deviation in the denominator, skewness is a
dimensionless quantity.
Probability distribution functions can have wide variations in shape from
completely flat to very sharply peaked about a single value. A measure of
this characteristic is kurtosis defined as
•
The factor 3 is chosen so that kurtosis for a variable with a Gaussian
distribution is zero. Negative kurtosis indicates a flat distribution with little
clustering relative to a Gaussian while positive kurtosis indicates a sharply
peaked distribution.
23
Statistical Properties of Kp and Dst
• The center of the Kp distribution is ~2.
• Dispersion about the central value is
about 1.
•Skewness for Kp is +0.744 indicating
that the pdf is skewed in the direction of
positive values.
• If the pdf were Gaussian, then the
standard deviation of the skewness
depends only on the total number of
points used in calculating the pdf and is
skewstd ~ sqrt(6/Ntot).
• For Kp this value is 0.0055 indicating a
highly significant departure from a
symmetric distribution.
• The corresponding values for Dst are –
2.737 and 0.0040 indicating very
significant asymmetry towards negative
values.
24
Quantity
Kp
Dst
Ntot
198,696
376,128
min
0.0
-589
max
9
92
mean
2.317
-16.49
median
2.000
-12
mode
1.3 to 1.7
-10 to 0
Ave,
deviation
1.7173
17.04
Standard
deviation
1.463
24.86
Lower
quartile
1.333
-26
Upper
quartile
3.333
-1
skewness
0.744
-2.737
skewstd
0.0055
0.0040
kurtosis
3.511
22.009
kurtstd
0.011
0.0080
The Shape of the Kp and Dst Distributions
•
Negative kurtosis indicates a flat distribution with little clustering relative to a
Gaussian while positive kurtosis indicates a sharply peaked distribution.
–
For Gaussian variables the standard deviation of the kurtosis also depends only
on the total number of points used in calculating the pdf and is approximately
kurtstd ~ sqrt(24/Ntot).
– Both distributions exhibit positive kurtosis, the Dst pdf to a greater extent than the
Kp distribution. Thus the distributions for both indices are more sharply peaked
than would be a Gaussian distribution.
25