Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Economics 173 Business Statistics Lectures 1 & 2 Summer, 2001 Professor J. Petry 1 Introduction • Purpose of Statistics is to pull out information from data – “without data, ours is just another opinion” – “without statistics, we are just another person on data overload” • Because of its broad usage across disciplines, Statistics is probably the most useful course irrespective of major. – More data, properly analyzed allows for better decisions in personal as well as professional lives – Applicable in nearly all areas of business as well as social sciences – Greatly enhances credibility 2 Statistics as “Tool Chest” • Different types of data, allow different types of analysis • Quantitative data – values are real numbers, arithmetic calculations are valid • Qualitative data – categorical data, values are arbitrary names of possible categories, calculations involve how many observations in each category • Ranked data – categorical data, values must represent the ranked order of responses, calculations are based on an ordering process. • Time series data – data collected across different points of time • Cross-sectional data – data collected at a certain point in time 3 Statistics as “Tool Chest” • Different objectives call for alternative tool usage • • • • • Describe a single population Compare two populations Compare two or more populations Analyze relationship between two variables Analyze relationship among two or more variables • By conclusion of Econ 172 & 173, you will have about 35 separate tools to select from depending upon your data type and objective 4 Describe a single population Compare two or more populations Compare two populations Problem Objective? Analyze relationships among two or more variables. Analyze relationships between two variables 5 Describe a single population Data type? Quantitative Type of descriptive measurements? Central location Variability t- test & estimator of m c2- test & estimator of s2 Qualitative Number of categories? Two Z- test & estimator of p Two or more c2 goodness of fit test 6 Compare two populations Quantitative Type of descriptive measurements? Central location Experimental design? Continue Continue Continue Continue Continue Continue Continue Continue Variability Data type? Qualitative Ranked Number of categories Experimental design? Two Z - test & estimator of p1 - p2 F- test & estimator of s12/s22 Continue Continue Continue Continue Continue Continue Continue Continue Two or more Independent samples Matched pairs Wilcoxon rank sum test Sign test c2-test of a contingency table 7 Continue Experimental Design Continue Independent samples Matched pairs Population distribution Distribution of differences Normal Nonnormal Normal Population variances Wilcoxon rank sum test t- test & estimator of mD Equal t- test & estimator of m1-m2 (equal variances) Nonnormal Wilcoxon signed rank sum test Unequal T-test & estimator of m1-m2 (unequal variances) 8 Compare two or more populations Experimental design? Independent samples Population distribution Normal Nonnormal Kruskal-Wallis test Quantitative Blocks Qualitative c2 - test of a Ranked Population distribution Normal Data type? contingency table Experimental design? Nonnormal Friedman test ANOVA ANOVA (independent samples) (randomized blocks) Independent samples Kruskal-Wallis test Blocks Friedman test 9 Analyze relationship between two variables Quantitative Population distribution Error is normal, or x and y are bivariate normal Simple linear regression and correlation Data type? Qualitative Ranked c2 - test of a contingency table Spearman rank correlation x and y are not bivariate normal Spearman rank correlation Analyze relationship between two or more variables Quantitative Multiple regression Data type? Qualitative Ranked Not covered Not covered 10 Numerical Descriptive Measures • Measures of central location – arithmetic mean, median, mode, (geometric mean) • Measures of variability – range, variance, standard deviation, coefficient of variation • Measures of association – covariance, coefficient of correlation 11 Measures of Central Location Arithmetic mean – This is the most popular and useful measure of central location Sum of the measurements Mean = Number of measurements Sample mean x nn ii11xxii nn Sample size Population mean m N i1 x i N Population size 12 • Example The mean of the sample of six measurements 7, 3, 9, -2, 4, 6 is given by i61 x i x71 x3 2 x93 x24 x45 x66 x 6 6 4.5 • Example Calculate the mean of 212, -46, 52, -14, 66 54 13 The median – The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude. Example 4.4 Seven employee salaries were recorded (in 1000s) : 28, 60, 26, 32, 30, 26, 29. Find the median salary. sort of theobservations salaries. OddFirst, number Then, locate the value in26,26,28,29,30,32,60 the middle Suppose one employee’s salary of $31,000 was added to the group recorded before. Find the median salary. First,number sort theofsalaries. Even observations There twothe middle values! Then, are locate values in 26,26,28,29,30,32,60,31 the middle 26,26,28,29, 26,26,28,29, 26,26,28,29, 30,32,60,31 30,32,60,31 29.5 , 30,32,60,31 14 The mode – The mode of a set of measurements is the value that occurs most frequently. – Set of data may have one mode (or modal class), or two or more modes. The modal class 15 – Example The manager of a men’s store observes the waist size (in inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40. • What is the modal value? 34 This information seems valuable (for example, for the design of a new display in the store), much more than “ the median is 33.2 in.”. 16 Relationship among Mean, Median, and Mode • If a distribution is symmetrical, the mean, median and mode coincide • If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median 17 ` • If a distribution is symmetrical, the mean, median and mode coincide • If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median A negatively skewed distribution (“skewed to the left”) Mean Mode 18 Median • Example A professor of statistics wants to report the results of a midterm exam, taken by 100 students. He calculates the mean, median, and mode using excel. Describe the information excel provides. Marks Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 73.98 2.1502163 81 84 21.502163 462.34303 0.3936606 -1.073098 89 11 100 7398 100 The mean provides information about the over-all performance level The Median indicates thatashalf of the of the class. It can serve a tool The mode must be used when data isfor class received a gradewith below 81%, making comparisons other qualitative. If marks are classified by and half and/or of the class received a grade classes other exams. letter grade,results the frequency of each Excel above 81%. grade can be calculated.Then, the mode becomes a logical measure to compute. 19 Measures of variability (Looking beyond the average) • Measures of central location fail to tell the whole story about the distribution. • A question of interest still remains unanswered: How typical is the average value of all the measurements in the data set? or How spread out are the measurements about the average value? 20 Observe two hypothetical data sets Low variability data set The average value provides a good representation of the values in the data set. High variability data set This is the previous data set. It is now changing to... The same average value does not provide as good presentation of the values in the data set as before. 21 The range – The range of a set of measurements is the difference between the largest and smallest measurements. – Its major advantage is the ease with which it can be But, how do all the measurements spread out? computed. ? to provide – Its major shortcoming?is its?failure Largest information onSmallest the dispersion of the values between measurement measurement the two end points. The range cannot assistRange in answering this question 22 The variance – This measure of dispersion reflects the values of all the measurements. – The variance of a population of N measurements x1, x2,…,xN having a mean m is defined as 2 s 2 N ( x m ) i 1 i N – The variance of a sample of n measurements x1, x2, …,xn having a mean x is defined as n 2 2 i 1( xi x ) s n 1 23 Consider two small populations: Population A: 8, 9, 10, 11, 12 Population B: 4, 7, 10, 13, 16 9-10= -1 11-10= +1 8-10= -2 12-10= +2 Thus, a measure of dispersion Let us start by calculating is needed agrees with this the sumthat of deviations observation. A 8 9 10 11 12 Sum = 0 The sum of deviations is zero in both cases, therefore, another measure is needed. …but measurements in B The mean of both are much more dispersed populations is 10... then those in A. B 4 7 10 13 4-10 = - 6 16-10 = +6 7-10 = -3 16 13-10 = +3 Sum = 0 24 9-10= -1 11-10= +1 8-10= -2 12-10= +2 The sum of squared deviations is used in calculating the variance. Sum = 0 The sum of deviations is zero in both cases, therefore, another measure is needed. A 8 9 10 11 12 4-10 = - 6 16-10 = +6 7-10 = -3 B 4 7 10 13 16 13-10 = +3 Sum = 0 25 Let us calculate the variance of the two populations 2 2 2 2 2 ( 8 10 ) ( 9 10 ) ( 10 10 ) ( 11 10 ) ( 12 10 ) s2A 2 5 2 2 2 2 2 ( 4 10 ) ( 7 10 ) ( 10 10 ) ( 13 10 ) ( 16 10 ) sB2 18 5 Why is the variance defined as the average squared deviation? Why not use the sum of squared deviations as a measure of After all, the sum of squared dispersion instead? deviations increases in magnitude when the dispersion of a data set increases!! 26 Which data set has a larger dispersion? Let us calculate the sum of squared deviations for both data sets However, when Datacalculated set B on “per observation” basis (variance), is more dispersed the data set around dispersions are properly ranked the mean A B 1 2 3 1 3 SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10 5 times SumB = (1-3)2 + (5-3)2 = 8 5 sA2 = SumA/N = 10/5 = 2 5 times ! sB2 = SumB/N = 8/2 = 4 27 – Example Find the mean and the variance of the following sample of measurements (in years). 3.4, 2.5, 4.1, 1.2, 2.8, 3.7 – Solution A shortcut formula i61 xi 3.4 2.5 4.1 1.2 2.8 3.7 17.7 x 2.95 6 6 6 n 2 n 2 n ( x x ) ( x ) 1 i i 2 2 i 1 i 1 s xi n 1 n 1 i 1 n =1/5[3.42+2.52+…+3.72]-[(17.7)2/6] = 1.075 (years) 28 – The standard deviation of a set of measurements is the square root of the variance of the measurements. Sample standard dev iation: s s2 Population standard dev iation: s s2 – Example Rates of return over the past 10 years for two mutual funds are shown below. Which one have a higher level of risk? Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05 Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3,2911.4 – Solution – Let’s use the Excel printout that is run from the “Descriptive statistics” sub-menu Fund A Fund A should be considered riskier because its standard deviation is larger Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Fund B 16 Mean 5.295 Standard Error 14.6 Median #N/A Mode 16.74 Standard Deviation 280.3 Sample Variance -1.34 Kurtosis 0.217 Skewness 49.1 Range -6.2 Minimum 42.9 Maximum 160 Sum 10 Count 12 3.152 11.75 #N/A 9.969 99.37 -0.46 0.107 30.6 -2.8 27.8 120 10 30 The coefficient of variation – The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. s Sample coefficient of v ariation: cv x s Population coefficient of v ariation: CV m – This coefficient provides a proportionate measure of variation. A standard deviation of 10 may be perceived as large when the mean value is 100, but only moderately large when the mean value is 500 31 Interpreting Standard Deviation • The standard deviation can be used to – compare the variability of several distributions – make a statement about the general shape of a distribution. • The empirical rule: If a sample of measurements has a mound-shaped distribution, the interval ( x s, x s) contains approximately 68% of the measuremen ts ( x 2s, x 2s) contains approximately 95% of the measuremen ts ( x 3s, x 3s) contains virtually all of the measuremen ts 32 – Example The duration of 30 long-distance telephone calls are shown next. Check the empirical rule for the this set of measurements. • Solution First check if the histogram has an approximate mound-shape 10 8 6 4 2 0 2 5 8 11 14 17 20 More 33 • Calculate the mean and the standard deviation: Mean = 10.26; Standard deviation = 4.29. • Calculate the intervals: ( x s, x s) (10.26 - 4.29, 10.26 4.29) (5.97, 14.55) ( x 2s, x 2s) (1.68, 18.84) ( x 3s, x 3s) (-2.61, 23.13) Interval Empirical Rule Actual percentage 5.97, 14.55 1.68, 18.84 -2.61, 23.13 68% 95% 100% 70% 96.7% 100% 34 Measures of Association • Two numerical measures are presented, for the description of linear relationship between two variables depicted in the scatter diagram. – Covariance - is there any pattern to the way two variables move together? – Correlation coefficient - how strong is the linear relationship between two variables 35 The covariance Population covariance COV(X, Y) (x i m x )(y i m y ) N mx (my) is the population mean of the variable X (Y) N is the population size. n is the sample size. Sample covariance cov(X,Y) (x i m x )(y i m y ) n-1 36 • If the two variables move the same direction, (both increase or both decrease), the covariance is a large positive number. • If the two variables move in two opposite directions, (one increases when the other one decreases), the covariance is a large negative number. • If the two variables are unrelated, the covariance will be close to zero. 37 The coefficient of correlation Population coefficien t of correlatio n COV ( X, Y) sxsy Sample coefficien t of correlatio n cov(X, Y) r sx sy – This coefficient answers the question: How strong is the association between X and Y. 38 +1 Strong positive linear relationship COV(X,Y)>0 or r = or 0 No linear relationship -1 Strong negative linear relationship COV(X,Y)=0 COV(X,Y)<0 39 • If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). • If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship). • No straight line relationship is indicated by a coefficient close to zero. 40 – Example Compute the covariance and the coefficient of correlation to measure how advertising expenditure and sales level are related to one another. Advert 1 3 5 4 2 5 3 2 Sales 30 40 40 50 35 50 35 25 Shortcut Furmulas ni1 ( x i x )(y i y ) ni1 x i y i ni1 ( x i 2 x) ni1 x i2 n i1 x i n i1 y i n 2 n i1 x n 41 • Use the procedure below to obtain the required summations x y xy x2 y2 1 2 3 4 5 6 7 8 1 3 5 4 2 5 3 2 30 40 40 50 35 50 35 25 30 120 200 200 70 250 105 50 1 9 25 16 4 25 9 4 900 1600 1600 2500 1225 2500 1225 625 Sum 25 305 1025 93 Month 2 n 1 i1 x 2 2 x i sx n 1 n s x 1.554 1.458 Similarly, sy = 8.839 cov (X, Y) ni1 ( x i x )(y i y ) n 1 1 n ni1 x i ni1 y i i1 x i y i n 1 n 1 25 305 1025 10.268 8 12175 7 1 232 93 1.554 7 8 cov (X, Y) 10.268 r .797 sx sy 1.458 8.839 42 • Excel printout Advertsmnt sales Advertsmnt 2.125 Sales 10.2679 78.125 Covariance matrix Advertsmntsales Advertsmnt 1 Sales 0.7969 1 Correlation matrix • Interpretation – The covariance (10.2679) indicates that advertisement expenditure and sales levelare positively related – The coefficient of correlation (.797) indicates that there is a strong positive linear relationship between 43 advertisement expenditure and sales level.