Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
251descr 8/16/01 (Open this document in 'Outline' view!) ECONOMICS 251 COURSE OUTLINE A. Introduction 1. Definitions Define Statistics, Descriptive and Analytic Statistics, Induction and Deduction. 2. Uses of Statistics B. Sources and Types of Data 1. Data Define data sets, observation, unit of observation. Qualitative and quantitative data. Nominal, ordinal, interval and ratio data. Discrete vs continuous data. a. Qualitative Data (i) Nominal Data: There is no natural number scale - numbers are only used to define categories, so that no operations like addition or multiplication are valid. (ii) Ordinal Data: Numbers are used only to order things (e.g. first, second, first). Differences between ranks do not always have the same meaning. Most mathematical operations are still not valid. b. Quantitative Data (i) Interval Data: Differences between ranks have consistent meaning, but, like Celsius temperature, there is no obvious origin, so that , although addition and subtraction can be used, multiplication and division have no real meaning. (ii) Ratio Data: there is a meaningful origin, so that multiplication and division are valid. 2. Sources Define primary and secondary sources, internal and external data. 3. Cross Section and Time Series Data a. Cross Section Data b. Time Series Data. i. Indices ii. Real Values iii. Rates of change iv Logarithms C. Presentation of Data 1. Classification Define collectively exhaustive and mutually exclusive classes. These are not the same thing. Collectively exhaustive means that every item you are considering has a place in a class. Mutually exclusive means that if an item belongs in any given class, it does not belong in another class as well. 2. Tables Define parts of tables. See 251pttbl . 3. Charts and Graphs Define parts of graphs 2 D. Frequency Distributions and Populations. 1. Definitions Meaning of Population, Frame, Census, Sample, Grouped Data, Frequency, Example of Frequency Distribution, Relative Frequency. Width of a class interval. largest smallest w (Always round this result up!) number of classes 2. Graphs of the Frequency Distribution. a. The Histogram b. The Frequency Polygon c. The Cumulative Frequency Distribution (Ogive). d. Relative Frequencies. e. Smoothed Histograms E. Sampling and Descriptive Statistics. 1. Sampling to Learn About a Population. Infinite and finite populations, target and sampled populations, the Stability of Mass Data. 2. The Meaning of Random Sampling. A simple random sample of n items taken from a population of N items must be selected in such a way that all combinations of n items are equally likely. 3. Descriptive Statistics. a. Measures of Central Tendency. (Where's the middle of the data?) b. Measures of Dispersion. (How spread out are the data?) c. Measures of Asymmetry etc. (What else can I say about the shape?) F. Measures of Central Tendency. 1. The Arithmetic Mean of Ungrouped Data. a. The Population Mean. x N b. The Sample Mean. x x n 3 2. The Arithmetic Mean of Grouped Data. For grouped data generally substitute f for . For x substitute the midpoint of the group. This is defined for our purposes as the arithmetic mean of the lower limit of the group in question and the lower limit of the next group. In other words if we have the group 10 to 10.99, followed by 11 to 11.99 the midpoint of the first group is 10.50, not 10.495. 3. The Weighted Arithmetic Mean. wx w , x wx w 4. The Median of Ungrouped Data. Defined simply as the middle point when the data is in order. If there are two middle points, take their arithmetic mean. In continuous data half the points will be above or below the median. 5. The Median of Grouped Data. pn F x1 p L p w . See formula for fractiles below and remember that the f p median is the .5 fractile. position 1 2 n 1 . 6. The Mode Simply the most common point, not very useful in discrete ungrouped data. For grouped data it is defined as the midpoint of the largest group. 4 7. Other Means. a. The Geometric Mean. 1 x g x1 x 2 x 3 x n n n x or ln x g 1 n ln( x) b. The Harmonic Mean. 1 1 xh n x 1 c. The Root-Mean-Square. 1 n x rms x 2 or 2 x rms 1 n x 2 d. What Formulas for Means Have in Common. f x 1 n f x 8. Measures of Position. Percentiles, deciles, quintiles, quartiles and fractiles. The two formulas below are two-step formulas. The first step is multiplying n 1 (or N 1 ) by p . p represents the fractile of the data wanted. For example, if we want the 91st percentile, p is .91. Note that the number you have found is called x1 p x1.91 x.09 (i.e. 9% from the top!). If we want the third quartile, Q3 x.25 , p is 3 4 or 0.75. If we want the first quartile, Q1 x.75 , p is 1 4 or 0.25. Of course, for the median p .5 . N or n represents the number of items in the population or sample, not the number of groups. a. Finding a Fractile of Grouped Data. To use this formula, we must first compute the cumulative distribution of the group and determine in which group the desired fractile is located with the calculation position pn 1 . Once we have found the group that this is in, let f p be the frequency of the chosen group, and let F be the cumulative frequency pn F up to but not including the chosen group. The formula here is x1 p L p w . In this formula, w f p is the class interval (the interval between the lower limit of the chosen group and the lower limit of the next group) and L p is the lower limit of the chosen group. Suppose that in the example below we must find the first quartile. Since the first quartile is the .25 fractile, p is .25. To locate the group use position pn 1 = 0.25(16)=4 . Profit Rate f F Using the cumulative distribution F 9-10.99% 3 3 column, we find the fourth item in the sample. 11-12.99% 3 6 Since 4 is above 3 and below 6 in the F column, 13-14.99% 5 11 we pick the group 11-12.99%. n is 15-16.99% 3 14 15, and for the group we have picked, w = 17-18.99% 1 15 13 - 11 = 2, L p 11 , F = 3, and f p 3 . Total 15 we find that x1.25 x.75 If we put these numbers into the formula, .25 15 3 11 2 11.5 . 3 5 pn F Note: Sometimes is negative. In this case choose the group before the one you would ordinarily f p have chosen. Example: If you want the 19th percentile of the data above position pn 1 =.19(16) = pn F .1915 3 3.04, which would normally take us into 11-12.99. But 0.075 , so use the group 3 f p 9-10.99 instead. But see c below. b. Finding a Fractile of Ungrouped Data. This time when we compute position pn 1 , we divide it into an integer part, a , and a fractional part, .b . For example, if n = 10, and we wish to find the first quartile, p = 0.25, so that pn 1 = 0.25 (11) = 2.75. Then a 2 , and .b .75 . Now find xa and xa 1 , in this case x2 and x3 , and use the formula x1 p xa .bxa1 xa . For example, if our sample consists of 10 numbers, 1,5,7,9,9,11,13,14,17 ,19, xa x2 5 and x a 1 x3 7 , so that x1 p x.75 5 0.757 5 6.5 c. Experimental formula (Don't read this!) Because of problems with the grouped data formula above, I intend to experiment with a new pair of formulas. position 1 pn 1 a.b (the position formula can be used with both grouped and ungrouped pn 1 0.5 F data ) and x1 p L p w . fp Example: Using the data in 8a n 15 First quartile: position 1 pn 1 1 .25(14) 4.5 . This is in group 11-12.99. .2514 0.5 3 x1.25 x.75 11 2 11 .67 . 3 Median: position 1 pn 1 1 .5(14) 8 (Same as with the old formula) This is in group 13-14.99. .514 0.5 6 x1.5 x.5 13 2 13 .6 . 5 Third quartile: position 1 pn 1 1 .75(14) 11.5 This is in group 15-16.99. .7514 0.5 11 x1.75 x.25 15 2 15 3 Seventy-fourth percentile: position 1 pn 1 1 .74(14) 11.36 This is in group 13-14.99. Why? For 13-14.99, F 11 . This means that numbers up to x11 are in 13-14.99 or lower groups and that x12 and numbers above it are in 15-16.99 and higher groups. Thus we set the boundary at 11.5. .74 14 0.5 6 x1.75 x.25 13 2 14 .94 5 Nineteenth percentile: position 1 pn 1 1 .19(14) 3.66 . This is in group 11-12.99. .19 14 0.5 3 x1.19 x.81 11 2 11 .11 3 6 G. Measures of Dispersion and Asymmetry. 1. Range Range highest number lowest number or highest midpoint lowest midpoint . Interquartile Range: IQR Q3 Q1 . 2. The Variance and Standard Deviation of Ungrouped Data. a. The Population Variance - Definitional and Computational Formulas. Definitional 2 x 2 Computational 2 N x N 2 2 Standard Deviation = variance b. The Sample Variance. Definitional s 2 x x 2 n 1 Computational s 2 x 2 nx 2 n 1 The computational formula is one of the most important formulas you will learn. Note that the same as x . For example, if x is 2,3,5 , x 2 2 x 2 is not 2 2 3 2 5 2 4 9 25 38 , not 2 3 52 10 2 100 . Example: Use x 2,3,5 Computational Method x2 x 2 4 3 9 5 25 10 38 From this we find x 10, Definitional Method x x x 2 -1.33333 3 -0.33333 5 1.66667 10 0.00001 x 2 38, x x 10 3.33333 n 3 and x x x x 2 1.77778 0.11111 2.77778 4.66667 2 4.66667 Note that x x should be zero, but is not because of rounding. Now, if we use the computational method, we x nx 38 33.33333 4.6667 2.3333 (Some texts prefer can use s 2 2 2 2 n 1 s2 x 1 x2 n n 1 2 3 1 2 1 2 38 10 4.66666667 3 2.33333 which give us a little more accuracy for a 3 1 2 little more work.) If we use the definitional method s 2 x x n 1 2 4.66667 2.33333 , but note that 2 we had to do three subtractions instead of 1. c. The Coefficient of Variation. C std .deviation mean 7 d. Chebyshef’s Inequality and the Empirical Rule Chebyshef Inequality: P x k 1 k2 Empirical rule: (For Symmetrical Unimodal distributions only) 68% within one standard distribution of the mean, 95% within two and almost all within three. 3. The Variance and Standard Deviation of Grouped Data. For grouped data generally substitute f for . 4. Skewness and Kurtosis. Population skewness, the 3rd k-statistic, coefficients of skewness; population kurtosis, the 4th k-statistic, the coefficient of excess; leptokurtic, platykurtic and mesokurtic distributions. The usual measurement of skewness is often called the third moment about the mean . (The population variance is the second). The formula for population skewness is: x 3 3 N . The corresponding sample statistic is the third k-statistic, k 3 n 1n 2 corresponding computational formulas are n 1 3 3 x 3 x 2 2 N 3 and k 3 N n 1n 2 3x formulas, put an f to the right of the x 3 n x 2 x x 3 . The 2nx 3 . To make grouped data sign. Positive values of these formulas imply skewness to the right, negative values to the left. Note that multiplying all the values of x by two would multiply the values of these coefficients by eight, but would not change the shape of the distribution. If we want to compare shapes, we need measurements that will not change if we multiply all values by a constant. Such a measure k would be called the coefficient of relative skewness, with the formulas 1 33 and g1 33 . Note that s for the Normal distribution 3 0 . Another measure of skewness is Pearson's measure of skewness, SK 3mean mode ; the median is sometimes used instead of the mode in this formula. std .deviation 8 Example: Profit Rate f 9-10.99 11-12.99 13-14.99 15-16.99 17-18.99 Total 3 3 5 3 1 15 fx x (midpoint) 10 12 14 16 18 fx2 300 432 980 768 324 2804 30 36 70 48 18 202 fx3 3000 5184 13720 12288 5832 40024 f n 15 , fx 202 , fx 2804 , fx 40024 , so that fx 202 13.467 and s fx nx 2804 1513.467 82.733 5.981 , which means x 2 So 3 2 2 2 2 n n 1 15 s 5.981 2.446 . C 15 1 14 s . 2.446 0182 . x 13.467 To measure skewness, use one of the following three results. n 15 k3 fx 3 3x fx 2 2nx 3 40024 313.467 2804 215 13.467 2 n 1n 2 14 13 k 158.249 0.680 .046 or = 0.680, or Relative Skewness g 1 33 (14 )(13) s 2.446 3 3mean mode 313 .467 14 0.163 . Note that, in this case, std .deviation 2.446 Pearson's Measure and Relative Skewness contradict each other as to the direction of skewness. Pearson's Measure of Skewness SK The measures of kurtosis are, for populations, 4 x N 4 1 N x 4 4 n2 n 1 k4 n 1n 2n 3 x 3 x x n 6 2 4 x 2 3 4 3n 13 s 4 . n2 and, for samples, k 4 can be considered an estimate of 4 3 4 . To get a measurement of shape use the coefficient of excess 2 4 3 or g 2 k4 . Since s4 the Normal distribution has 4 3 4 , the coefficient of excess is zero for the Normal distribution. Kurtosis has traditionally been considered a measure of the peakedness of a distribution relative to the Normal distribution, though there are some exceptions to this interpretation. If the coefficient of excess is positive, we may call a distribution leptokurtic or sharp-peaked. If the coefficient of excess is negative, the distribution can be called platykurtic or flat-peaked. If the coefficient of excess is close to zero, we call the distribution mesokurtic, middle-peaked. A symmetric, mesokurtic distribution is essentially Normal. 4 9 Example (using definitional formulas): Profit Rate x f midpoint 9-10.99 3 10 11-12.99 3 12 13-14.99 5 14 15-16.99 3 16 17-18.99 1 18 Total 15 So x x fx 30 36 70 48 18 202 -3.467 -1.467 0.533 2.533 4.533 f x x -10.400 -4.400 2.667 7.600 4.533 0.000 f n 15 , fx 202 , f x x 0 , f x x f x x 8.249 and s2 3 f x x n 1 2 f x x 36.053 6.453 1.422 19.253 20.551 83.732 f x x 3 -124.985 -9.465 0.759 48.775 93.164 8.249 433.323 13.885 1.079 123.457 422.317 944.466 83.732 , f x x 944.466 , so that x 4 f x x 2 fx 202 13.467 and n 15 2 s 82.733 . . 5.981 , which means s 5.981 2.446 . C 2.446 0182 x 13.467 14 To measure skewness, use one of the following three results. k 3 Relative Skewness g1 0.680, or Pearson's Measure of Skewness SK 3 mean mode n 3 (n 1)(n 2) k3 s f x x 3 0.680 2.446 3 15 8.249 1413 = .046 or 313.467 14 . Note that, in this case, 0163 . std. deviation 2.446 Pearson's Measure and Relative Skewness contradict each other as to the direction of skewness. f x x 4 3n 13 s 4 n2 n 1 k4 n 1n 2n 3 n n2 k 310337 . 0.868 . The negative sign implies that the distribution is =-31.0337. So g 2 44 s 5.981 2 platykurtic. 5. Review a. Grouped Data. b. Ungrouped Data. 10 4 Appendix: Explanation of Sample Formulas (Not for student consumption) 1. The Sample Variance. 1 x x 2 1 x 2 nx 2 . If s 2 has an expected n 1 n 1 2 2 value of 2 it must be true that E x x E x nx 2 n 1 2 . We can assume, without loss of generality that E x 0. Under these conditions, the Variance is defined as 2 E x 2 E x 2 . The Sample Variance is defined as s 2 Thus E x n 2 2 . An expression like x 2 1 x has terms like 1 n 2 x x . Because of the 2 1 2 n independence assumption on the sample, all these terms have expected values of zero except for terms with 2 2 1 1 1 1 x 2 E x 2 n 2 2 . Thus two identical subscripts and E x 2 E 2 n n n n E x 2 x nEx n nx 2 E 2 2 2 1 n 2 n 1 2 . n 2 2. The Third k Statistic. n n x x 3 n 1n 2 n 1n 2 If k 3 has an expected value of 3 , it must be true that If the third k statistic k 3 x x E x 3 3 3x x 2 2nx 3 x 3 3x x 2 2nx 3 n 1n 2 3. n We can assume, without loss of generality that E x 0. Under these conditions, the skewness is E defined as 3 E x 3 E x 3 . Thus E like x n 3 3. An expression like x 3 1 n 3 x 3 has terms 1 x1 x 2 x3 . Because of the independence assumption on the sample, all these terms have expected n3 values of zero except for terms with three identical subscripts and 3 3 1 1 1 1 x 3 . Thus E x 3 E 3 x 3 E x 3 n 3 2 3 . By the same reasoning E x n n n n E x 3 3x x 2nx E x 3E x x 2nEx n 3 3 3 2n 2 1 n2 3 3 2 n 3 n 3 3 2 3 n 1n 2 . n 2 3n 2 3 3 n n 3. And now, for considerable extra credit, what can you say about the expected value of k 4 ? 11