Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Topic 1: Descriptive Statistics CEE 11 Spring 2001 Dr. Amelia Regan These notes draw liberally from the class text, Probability and Statistics for Engineering and the Sciences by Jay L. Devore, Duxbury 1995 (4th edition) definitions A population consists of all objects of a certain type that are relevant to a particular study or analysis. all students at UCI represent a population A sample is a subset or portion of the population students in this class represent a sample of the population of students at UCI frequency distributions and histograms A frequency is a count, the number of occurrences in the sample of a particular value which are within a particular class. Classes must be mutually exclusive (no overlap allowed) and collectively exhaustive (the full range of the data must be covered). A histogram is a bar chart of the frequency distribution. guidelines for forming class intervals Use intervals of equal length with midpoints at convenient round numbers. For For large data sets use more intervals small data sets use a small number of intervals Example 30 students are asked to submit their weights with these results Men (18 in sample) 140 170 145 157 160 130 190 185 155 190 165 155 130 155 150 148 150 140 Women (12 in sample) 140 118 120 122 130 115 138 102 121 115 125 150 Example We might break the sample into classes and construct the following frequency table class 100-<120 120-<140 140-<160 160-<180 180-<200 frequency 4 8 11 3 4 rel freq. 0.133 0.267 0.367 0.10 0.133 Class mid. pt 110 130 150 170 190 From the table we can easily construct a histogram for the sample number of observations Frequency Histogram for Weight Data 12 10 8 6 4 2 0 100-<120 120-<140 140-<160 classes 160-<180 180-<200 mean The mean of a sample or data set is simply the arithmetic average of the values in the set, obtained by summing the values and dividing by the number of values. x1 x2 ...xn 1 n x xi n n i 1 The mean of the sample of weights is 144.63 pounds mean of a frequency distribution When we summarize a data set in a frequency distribution, we are approximating the data set by "rounding" each value in a given class to the class mark. n 1 n x fi xi pi xi n i 1 i 1 where fi the frequency of the ith observation and pi = the proportion associated with the ith observation The mean of the weight data obtained in this way is 146.67 median The median is the value that is roughly in the middle of the data set. If n is odd, the median is the single value in the middle, namely the value with rank (n + 1)/2. If n is even, there is not a single value in the middle, so the median is defined to be the average of the two middle values, namely the values with ranks n/2 and n/2 + 1. The median for our example is (140+145)/2 = 142.5 lbs. mode The mode of a data set is the value that appears most often. The modal values for our sample are 130 and 140 -- the mode need not be a single value If data are broken into classes, the modal class is the class with the most members. The modal class for our sample is 140-<160 range The range or spread of of a data set is the difference between its largest and smallest values The range for the weight data is 102 to 190 or 88 lbs variance The variance of a population is the average of the squared deviations from the mean The variance of a sample is approximately the average of the squared deviations from the mean (note that we divide the sum of the squared deviations by n-1 rather than n) S 2 2 1 N n 2 ( x ) i i 1 n 1 2 ( x x ) i ( n 1) i 1 standard deviation The standard deviation is the square root of the variance The standard deviation is useful because it is in the same units as the mean (and the original data) therefore it provides better insight into the relative variability a sample. 1 N n ( xi i 1 ) 2 S n 1 ( xi x ) 2 ( n 1) i 1 The variance and standard deviation of the weight data are 559.14 lbs2 and 23.64 lbs coefficient of variation The coefficient of variation is the standard deviation divided by the mean The coefficient of variation is used to examine the relative variability of more than one data set for the weight data the coefficient s c.v. of variation is 0.163 x c.v. shortcut formula for the variance Its sometimes more convenient to use the following formula for the variance xi n 2 i 1 x i n i 1 n 1 n n s2 x x i 1 i n 1 2 2 Class exercise The national weather service maintains and publishes historical weather data for 100 US cities. The average annual rain fall in inches for the cities in the data base beginning with A are listed below. Calculate the mean, median, range, variance and standard deviation for the following data Albany Albuquerque Anchorage Asheville Atlanta Atlantic City Austin 35.74 8.12 15.20 47.71 48.61 41.93 31.50 properties of S2 Let x1, x2, x,...,xn be a sample and c be any nonzero constant. If y1 = x1 + c, y2 = x2 + c,...,yn = xn + c, then S2y = S2x If y1 = cx1, y2 = cx2,...,yn = cxn, then S2y = c2S2x, Sy = |c|S2x In other words -- if we add a constant to a sample we do not increase the variance -- if we multiply by a constant we increase the variance by the square of the constant related properties of the sample mean Let x1, x2, x,...,xn be a sample and c be any nonzero constant. If y1 = x1 + c, y2 = x2 + c,...,yn = xn + c then y x c If y1 = cx1, y2 = cx2,...,yn = cxn, then y cx In other words if we add or multiply the sample by a constant we add or multiply the mean by the same constant Class exercise Without using your calculators, calculate the mean and variance of the following data Xi | 35 40 45 50 55 ---------------------------------------------fi | 13 11 14 13 12 Hint, shift the observations “to the left” by subtracting a constant and then divide by another constant