Download Statistical Characterization of Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistical Characterization of Data
What are the model-independent characteristics of a set of data?
When a set of data tends to cluster about a particular value, they are said to have a
central tendency. There are certain important measures of central tendency. They
include:
The first moment, the mean, defined as:
N
Âx
x=
i
1
N
The second moment, the variance, which is a measure of the spread about the central
value, is defined as:
N
 (x
†
s2 =
j
- x) 2
j=1
N -1
The square root of the variance is known as the standard deviation, s. There are higher
moments, which can be illustrated in the following diagram:
†
Skewness is a measure of the asymmetry of the distribution. It is defined as:
3
1 N Èx j - x˘
skewness = ÂÍ
˙
N j=1 Î s ˚
To be significant, the skewness should be greater than (15/N)1/2
Kurtosis
†
Kurtosis is a measure of the flatness of the distribution. Formally, kurtosis is defined as:
4¸
Ï NÈ
Ô1
x j - x˘ Ô
kurtosis = Ì ÂÍ
˙ ˝-3
ÔÓ N j=1 Î s ˚ Ô˛
Kurtosis should be greater than (96/N)1/2.
Median
†
The median is the value of x for which larger and smaller values of x are equally
probable. Formally for even N, the median is 0.5(xN/2 + xN/2+1). For odd N, the median is
x(N+1)/2.
t-Test
The t-test is designed to answer the question: Do two distributions have the same mean?
If we define SD as the standard error of the difference of two means, then
SD =
t=
Â
A
(x i - x A ) 2 + Â (x i - x B ) 2 Ê 1
1 ˆ
B
+
Á
˜
NA + NB - 2
Ë NA NB ¯
xA - xB
SD
t is compared with expectations based upon assumed statistics (where one assumes the
distributions have equal variances).
†
F-test
This statistic is designed to answer the question: Do two distributions have different
variances? F is defined as the ratio of (variance of A)/(variance of B). The value of F is
compared with “expectations”.
C2 Test (Chi-squared test)
The C2 test is designed to answer the question of whether two distributions are different.
Formally C2 is defined as
(N i - n i ) 2
2
C =Â
ni
i
th
where Ni is the number of events in the i bin in distribution 1 while ni is the number of
events in the ith bin in distribution 2. If we define the number of degrees of freedom, n,
as the number of data points – the number of parameters determined from the points, we
can also define a quantity,†the reduced C2 is defined as C2/n. Roughly the reduced chisquared should be about 1. The next page shows the expected values of F and C2.
Distributions
Binomial Distribution
Describes the probability of observing x successes out of n tries when the probability for
success in each try is p;
n!
PB (x;n, p) =
p x (1- p) n-x
x!(n - x)!
m = np
s 2 = np(1- p)
†
Poisson distributions
A Poisson distribution is the limiting case of the binomial
distribution for m<<n because p << 1; it is appropriate for
describing small samples from a large population.
PP (x, m) =
mx -m
e
x!
s2 = m
†
Gaussian (Normal) Distribution
A Gaussian distribution is a limiting case of the binomial distribution for large n and
finite p; it is appropriate for smooth symmetric distributions.
È 1 Ê x - m ˆ2˘
1
PG (x, m,s ) =
expÍ- Á
˜˙
s 2p
Î 2Ë s ¯ ˚
The half-width, G, is 2.354s while the probable error is 0.6745s.
†
Lorentzian distribution
The distribution relation is
PL (x, m,G) =
†
Propagation of Errors
Suppose we wish to determine x where
1
G /2
p (x - m) 2 + (G /2) 2
x = f (u,v,...)
Ê ∂x ˆ
Ê ∂x ˆ
x i - x @ (ui - u)Á ˜ + (v i - v)Á ˜ + L
Ë ∂u ¯
Ë ∂v ¯
2
We can express the variance s in terms of the variances of u, v, etc. as
2
2
2
2 Ê ∂x ˆ
2 Ê ∂x ˆ
s x = s uÁ ˜ + s v Á ˜ + L
Ë ∂u ¯
Ë ∂v ¯
†
where we have neglected any correlation between u and v, etc. Using these ideas we can
make a handy little table of relations that allow one to calculate the standard deviation
associated with quantities after performing various arithmetic operations. The table is as
†
follows:
Function
Standard Deviation
x=a+b
sx=(sa2 + sb2)1/2
x=ab
sx=x((sa/a)2+(sb/b)2)1/2
x=a/b
sx=x((sa/a)2+(sb/b)2)1/2
±b
x=au
sx=±bsux/u
x=ae±bu
sx=±bsux
±bu
x=a
sx=±(b ln a)sux
x=a ln(±bu)
sx=±absu/u
Weighted Means
All this stuff is nice but a tad unrealistic. Suppose you have a group of numbers that you
are averaging. If one of them is very uncertain, you might not want that number to count
the same as the rest in computing the average. The same comment applies to all the
statistical measures you might compute from the data. Thus we need to understand the
use of weights in computing various statistics. Consider a set of points xi with
uncertainties si. The weighted average of the group is given as
 wi x i
x=
 wi
where the weighting factors are taken as
1
s i2
†
The variance of the weighted mean is then given as
1
s m2 =
Ê1ˆ
†
ÂÁ s 2 ˜
Ë i¯
Smoothing
wi =
Generally smoothing is not something you wish to do with data because of the
†
possibility of distorting or altering
it. However, smoothing is a useful trend for allowing
visual recognition of trends in data. It is important that smoothing be carried out in an
impartial and well recognized manner. One very well recognized method of smoothing
data is to apply a Savitsky-Golay filter to the data. This procedure involves, in effect,
fitting the data to extract a smooth tendency from it. The formal definition of the
Savitsky-Golay filters is given in the following table.
Thus, to apply a 5-point smooth to a data set (a common choice), we say:
Y(I)=(-3*Y(I-2)+12*Y(I-1)+17*Y(I)+12*Y(I+1)-3*Y(I+2))/35.
We apply this procedure stepwise for each point in the array. The effect of this
smoothing upon data is shown in the next two figures where smoothing the data reveals
some peak structures not obvious in the original data.