Download Statistical Characterization of Data

Statistical Characterization of Data What are the model-independent characteristics of a set of data? When a set of data tends to cluster about a particular value, they are said to have a central tendency. There are certain important measures of central tendency. They include: The first moment, the mean, defined as: N Âx x= i 1 N The second moment, the variance, which is a measure of the spread about the central value, is defined as: N Â (x † s2 = j - x) 2 j=1 N -1 The square root of the variance is known as the standard deviation, s. There are higher moments, which can be illustrated in the following diagram: † Skewness is a measure of the asymmetry of the distribution. It is defined as: 3 1 N Èx j - x˘ skewness = ÂÍ ˙ N j=1 Î s ˚ To be significant, the skewness should be greater than (15/N)1/2 Kurtosis † Kurtosis is a measure of the flatness of the distribution. Formally, kurtosis is defined as: 4¸ Ï NÈ Ô1 x j - x˘ Ô kurtosis = Ì ÂÍ ˙ ˝-3 ÔÓ N j=1 Î s ˚ Ô˛ Kurtosis should be greater than (96/N)1/2. Median † The median is the value of x for which larger and smaller values of x are equally probable. Formally for even N, the median is 0.5(xN/2 + xN/2+1). For odd N, the median is x(N+1)/2. t-Test The t-test is designed to answer the question: Do two distributions have the same mean? If we define SD as the standard error of the difference of two means, then SD = t= Â A (x i - x A ) 2 + Â (x i - x B ) 2 Ê 1 1 ˆ B + Á ˜ NA + NB - 2 Ë NA NB ¯ xA - xB SD t is compared with expectations based upon assumed statistics (where one assumes the distributions have equal variances). † F-test This statistic is designed to answer the question: Do two distributions have different variances? F is defined as the ratio of (variance of A)/(variance of B). The value of F is compared with “expectations”. C2 Test (Chi-squared test) The C2 test is designed to answer the question of whether two distributions are different. Formally C2 is defined as (N i - n i ) 2 2 C =Â ni i th where Ni is the number of events in the i bin in distribution 1 while ni is the number of events in the ith bin in distribution 2. If we define the number of degrees of freedom, n, as the number of data points – the number of parameters determined from the points, we can also define a quantity,†the reduced C2 is defined as C2/n. Roughly the reduced chisquared should be about 1. The next page shows the expected values of F and C2. Distributions Binomial Distribution Describes the probability of observing x successes out of n tries when the probability for success in each try is p; n! PB (x;n, p) = p x (1- p) n-x x!(n - x)! m = np s 2 = np(1- p) † Poisson distributions A Poisson distribution is the limiting case of the binomial distribution for m<<n because p << 1; it is appropriate for describing small samples from a large population. PP (x, m) = mx -m e x! s2 = m † Gaussian (Normal) Distribution A Gaussian distribution is a limiting case of the binomial distribution for large n and finite p; it is appropriate for smooth symmetric distributions. È 1 Ê x - m ˆ2˘ 1 PG (x, m,s ) = expÍ- Á ˜˙ s 2p Î 2Ë s ¯ ˚ The half-width, G, is 2.354s while the probable error is 0.6745s. † Lorentzian distribution The distribution relation is PL (x, m,G) = † Propagation of Errors Suppose we wish to determine x where 1 G /2 p (x - m) 2 + (G /2) 2 x = f (u,v,...) Ê ∂x ˆ Ê ∂x ˆ x i - x @ (ui - u)Á ˜ + (v i - v)Á ˜ + L Ë ∂u ¯ Ë ∂v ¯ 2 We can express the variance s in terms of the variances of u, v, etc. as 2 2 2 2 Ê ∂x ˆ 2 Ê ∂x ˆ s x = s uÁ ˜ + s v Á ˜ + L Ë ∂u ¯ Ë ∂v ¯ † where we have neglected any correlation between u and v, etc. Using these ideas we can make a handy little table of relations that allow one to calculate the standard deviation associated with quantities after performing various arithmetic operations. The table is as † follows: Function Standard Deviation x=a+b sx=(sa2 + sb2)1/2 x=ab sx=x((sa/a)2+(sb/b)2)1/2 x=a/b sx=x((sa/a)2+(sb/b)2)1/2 ±b x=au sx=±bsux/u x=ae±bu sx=±bsux ±bu x=a sx=±(b ln a)sux x=a ln(±bu) sx=±absu/u Weighted Means All this stuff is nice but a tad unrealistic. Suppose you have a group of numbers that you are averaging. If one of them is very uncertain, you might not want that number to count the same as the rest in computing the average. The same comment applies to all the statistical measures you might compute from the data. Thus we need to understand the use of weights in computing various statistics. Consider a set of points xi with uncertainties si. The weighted average of the group is given as Â wi x i x= Â wi where the weighting factors are taken as 1 s i2 † The variance of the weighted mean is then given as 1 s m2 = Ê1ˆ † ÂÁ s 2 ˜ Ë i¯ Smoothing wi = Generally smoothing is not something you wish to do with data because of the † possibility of distorting or altering it. However, smoothing is a useful trend for allowing visual recognition of trends in data. It is important that smoothing be carried out in an impartial and well recognized manner. One very well recognized method of smoothing data is to apply a Savitsky-Golay filter to the data. This procedure involves, in effect, fitting the data to extract a smooth tendency from it. The formal definition of the Savitsky-Golay filters is given in the following table. Thus, to apply a 5-point smooth to a data set (a common choice), we say: Y(I)=(-3*Y(I-2)+12*Y(I-1)+17*Y(I)+12*Y(I+1)-3*Y(I+2))/35. We apply this procedure stepwise for each point in the array. The effect of this smoothing upon data is shown in the next two figures where smoothing the data reveals some peak structures not obvious in the original data.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistical Characterization of Data