Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

GG 313 Lecture 5 Elementary Statistics Class Web Site is UP http://www.soest.hawaii.edu/~fred/GG313/ CRUISE: Please email me your phone number if you think you may come. I will need this to let you know of last-minute changes. Homework 2 1) What’s the contrast in density? What’s the uncertainty? 2) Trick is to set it up correctly All independent. z z r value1* (Value1) 2 mTm t Value1 ( m w ) 2 2 2 T 2 Sqrt( 2 2 ) m w (Value1) Sqrt m m (Value3) m w m Tm 2 t 2 Sqrt 0.5 t t d(value3) (Value3) t 0.5 dt t 3rd question? Statistics We often want to quantify the characteristics of data sets using a small number of parameters (statistics), often just a representative value. The POPULATION consists of all the possible observations of a phenomenon. A population could be finite or infinite. Any subset of a population is called a SAMPLE. For example 10 coin tosses is a subset of all coin tosses, which is infinite. Often our aim is to discover characteristics of a population by sampling the population. Political polls sample a small fraction of the voter population to determine what the total population thinks. A geo-example: You want to sample the population of dinosaurs living in the late Mesozoic in Wyoming. You look at existing fossils and collect fossils yourself to determine the characteristics of the dinosaur population. Because some dinosaurs may have lived in habitats where fossil formation was not likely, the sample is probably not representative. We often want to know the central value of a population and sample, but there are several different “centers”. The three M’s - Mean, Median, and Mode The best known estimate of central location is the arithmetic MEAN. The mean of the sample is: 1 n x xi n i1 The population mean (µ) is identical in form, but all members of the population are included. More often than not, we really don’t know µ, and are trying to determine it. Characteristics of the mean: • It always exists and can be calculated • It is unique • It is stable, not fluctuating much from sample-to-sample • means from different samples can be combined to form new statistics • Every value is used to obtain the mean Sample MEDIAN In cases where there a few wild data points, the sample mean may yield a poor estimate of the central value. A statistic that tends to ignore wild data is the sample median: x˜ { x n / 21, n odd 1 (x n / 21 x n / 2 ), n even 2 The median is the middle value when the data are arranged in ascending order. It always exists and it is unique. Since the median does not depend on the value of other points in the sample, it is not sensitive to wild values. Consider the sample: 1,2,2,5,5,7,99. It has a mean of 17.3 and a median of 5. The great difference between the mean and median in this case is because of the one wild value (99). Mode: The mode of a sample is the most frequent value. In the sample above, the mode is non-unique - both 2 and 5 occur twice. It may not exist at all if no two values occur more than once. We can usually make a sample with at least one mode by grouping the samples into bins of almost-identical values. The mode is often denoted as xˆ In some cases, some elements of a sample are considered more important than others. Some measurements may be more precise, or better documented than others. Large earthquakes may have better statistics than small ones, for example. We may want large earthquakes to count more than small ones when determining the location of an earthquake swarm. For situations such as this we can use a weighted mean. n x w x i i1 i n w i i1 Where wi is the weight of the ith measurement. VARIATIONS While the central value of a sample is important, the variations around that value are often equally important. When we talked about exploratory data analysis we talked about the largest and smallest values and the “hinges”, or values bounding the middle half of the values. We could look at the deviations of each point from the mean: xi xi x And take the average of the deviations: 1 n x x i x n i1 But this value is always zero, since the mean is defined to be the mid-point of all deviations. You could take the absolute value of the deviations before summing, but this isn’t often done. The most common deviation of a population is defined by: 2 n 1 x i N i1 2 Where 2 is called the variance of the population, and is called the standard deviation of the population. 2 n 2 1 x i N i1 Usually, we’ll be working with samples of the population, not the population itself, and the functions of variance and standard deviation are almost the same: the sum is divided by n-1, rather than N: s s2 n 2 1 x i x (n 1) i1 While this change isn’t very important in most cases, if n is small, the difference between n and n-1 can be significant. Why the change? In going from the population to the sample, we’ve lost one degree of freedom. The number of independent pieces of information that go into the estimate of a parameter are called the degrees of freedom. The number of values that can vary in the variance is n-1, since the mean utilizes one of those values. BUT……..????? Let’s look at the equation for variance of a sample closer. What value of x minimizes the value of s2 ? We already know the answer, but can we prove it? Let: f (x ) s2 n 2 1 x i x (n 1) i1 Then we want to know the value of x that occurs when d(f(x)/dx=0, insuring either a maximum or minimum in f(x). If d2(f(x)/dx2>0, the value must be a minimum. Differentiating: n df dx n 2x i1 i x n 1 2 n x i x 0 n 1 i1 1 n x i x 0, or x = n x i i1 i1 The 2nd derivative of the function is d 2f 1 n 2n = x , which is > 0, so the result is a minimum i 2 dx n i1 n 1 So, the value of x that minimizes the variance is the mean that was defined earlier. This value, the mean, minimizes the least-squares mis-fit in the data samples, and is often called the L2 estimate of the central value. Similarly, we can show that the mean absolute deviation is minimized by using the median central estimate rather than the mean. The median is called the L1 estimate of central location. n n x i x˜ d 1 1 0 x i x˜ dx˜ n i1 n i1 x i x˜ Look at the term in the sum, it can only take on values of ±1/1 or 0/0. For the sum to be zero, the numbers of values greater than x˜ must equal the number of values less than x˜ , which defines the median. ROBUST Estimation Estimation of central values and deviations don’t do us much good if they are sensitive to which sample we take. We are better off using estimators that are robust, that is insensitive to variations in the sample. We introduce the concept of “breakdown point” - the smallest fraction of points in a sample that have to be replaced by outliers to cause the estimator to lie outside reasonable values. The mean is sensitive to even one outlier, thus its breakdown point is 1/n. The median, on the other hand will not be thrown off until about half of the data values have been replaced by outliers, so its breakdown point is 50%. Example of robustness of mean and median original Mean Median Std Dev 3 Std Devs -0.001 0.044 0.031 0.047 -0.020 0.020 -0.034 -0.076 -0.002 0.011 0.000 -0.023 -0.0003 -0.0009 0.0350 0.1049 1 out 2 out -0.001 0.044 0.031 0.047 -0.020 0.020 -0.034 -1.000 -0.002 0.011 0.000 -0.023 -0.077 -0.001 0.292 0.875 3 out -0.001 0.044 1.000 0.047 -0.020 0.020 -0.034 -1.000 -0.002 0.011 0.000 -0.023 0.003 -0.001 0.427 1.281 4 out -0.001 0.044 1.000 0.047 -0.020 0.020 -0.034 -1.000 -0.002 0.011 1.000 -0.023 0.087 0.005 0.515 1.545 5 out -0.001 0.044 1.000 0.047 0.500 0.020 -0.034 -1.000 -0.002 0.011 1.000 -0.023 0.130 0.015 0.527 1.581 6 out -0.001 0.044 1.000 0.047 0.500 -1.000 -0.034 -1.000 -0.002 0.011 1.000 -0.023 0.045 0.005 0.620 1.861 -1.000 0.044 1.000 0.047 0.500 -1.000 -0.034 -1.000 -0.002 0.011 1.000 -0.023 -0.038 0.005 0.690 2.070 The column on the left is the original series of n=12 random numbers. Each successive column to the right has had one value replaced by an outlier. The mean, median, and standard deviation are calculated at the bottom. 0.8000 mean 0.7000 median std dev 0.6000 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000 0 1 2 3 4 5 6 7 -0.1000 -0.2000 This is a plot of the above results. Note that the standard deviation increases rapidly with the number of outliers, while the mean shows far more variation than the median. Thus, the median is more robust than the mean. From the above graph, we also see that the standard deviation is not robust, growing steadily as the number of outliers increases. We don’t need to look far for a robust variation statistic; x i x˜ recall that the median of isminimized if is thex˜ median of the sample.We can thus define the median absolute deviation (MAD) as: MAD 1.482 median x i x˜ The factor 1.482 is a fudge factor that makes the MAD values equivalent to the values of standard deviation. 0.8000 mean 0.7000 median std dev 0.6000 MAD 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000 -0.1000 0 1 2 3 4 5 6 7 -0.2000 The MAD value has been added to the previous graph. Note that it is FAR more robust than the standard deviation until the number of outliers reaches n/2. How do we identify outliers? We don’t want to delete real data! This is a real problem - as noted earlier. Great care must be taken when deleting data from consideration - and a search should be made for why the data are bad. A data point being statistically “off” is not sufficient reason to delete it. We can normalize the MAD value by defining a new variable: x i x˜ zi MAD This new variable is unitless, and we can arbitrarily cut off all points with values where |zi|>3. This implies that data points to be deleted are more than 3 deviation units away from the median. MAD: zi: original 0.03 1 out 0.03 2 out 0.03 3 out 0.05 4 out 0.05 5 out 0.06 6 out 0.40 0.02 1.43 1.01 1.51 0.61 0.66 1.06 2.36 0.02 0.37 0.02 0.69 0.02 1.43 1.01 1.51 0.61 0.66 1.06 31.63 0.02 0.37 0.02 0.69 0.02 1.43 31.69 1.51 0.61 0.66 1.06 31.63 0.02 0.37 0.02 0.69 0.12 0.81 20.27 0.86 0.50 0.31 0.79 20.46 0.13 0.12 20.27 0.56 0.33 0.56 19.10 0.61 9.40 0.09 0.96 19.69 0.33 0.09 19.10 0.74 0.10 0.65 16.42 0.70 8.17 16.57 0.64 16.57 0.10 0.10 16.42 0.45 2.52 0.10 2.50 0.11 1.24 2.52 0.10 2.52 0.02 0.02 2.50 0.07 These are the zi values for the previous example. Note that each of the outliers shows up as having values far greater than 3, indicating that they are somehow “bad” and can be safely deleted. What should you get from this? You now have a robust method for removal of outliers from a data set. You must also realize that the mean and standard deviation are NOT good statistics to use for removal of outliers. Removal of outliers can be very important in many applications, and while leastsquares operations, such as mean and standard variation, are extremely useful, they are best applied to data with NO outliers. An excellent method for data analysis is thus provided: 1) Plot your raw data 2) Calculate the median, MAD, and zi values 3) Remove the outliers 4) Calculate the mean and standard deviation These are sometimes called the least-trimmed squares estimates (LTS) Inferences about the mean We are usually working with samples from some population. How well do the statistics of our sample compare with the statistics of the population? In some cases, like the age of a rock, there is only one correct answer, and no standard deviation. But how well does our estimate of the age compare with the “true” age? In other cases, we would like the distribution of our sample to reflect the distribution of the population. An important concept is presented by the Central Limit Theorum. It states that: If n (the sample size) is large, then the then the variation of sample means closely approximates a normal distribution. The sample mean, x, is an unbiased estimate of the population mean, µ. It can also be shown that the standard deviation of the means of many samples, sx , is related to the population standard deviation, , by: sx or: sx n n Nn N 1 Depending on whether N is finite or infinite. Note that sx 2 approaches zero as n gets large. The variance of s for many samples has a mean value of 2. The variance of the of the variance is related to the population variance estimate by: 2 2 2 s n 1 Covariance and correlation Earlier we noted that the sample variance is defined as: n n 2 x i x x i x x i x sx2 i1 i1 n 1 n 1 We often deal with pairs of properties, such as temperature vs depth, nitrogen vs oxygen, silica vs potassium, etc. And we would like to know how these parameters are related to each other. We can devine a variance for each property, x and y. For y: n sy2 y i1 n y 2 i n 1 y i1 i y y i y n 1 We define the covariance as: n sxy2 x i1 i x y i y n 1 The covariance tells us how x and y vary together. But what does it mean, particularly if the units of x and y are different. Note that the covariance can be negative! We overcome this problem by normalizing and 2 getting rid of units: s r xy sx sy This value, r, is the correlation coefficient. 2 xy s r sx sy If |r| is 1, x nd y are perfectly correlated. If r=1, x and y are identical. If r=-1, they are identical but opposite in phase. If r is close to zero, x and y are uncorrelated. Consider the correlations on the next slide: Note the circle in f). The correlation is zero despite the fact that x and y are highly related to each other. The correlation coefficient is looking for a linear correlation. Moments The rth moment of a sample is defined as: n 1 m r (x i ) r n i1 Thus, L1 : µ: mean L2 : 2 : variance L3 : SK: skewness (symmetry) L4 : K: kurtosis (sharpness)