* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Basic Stat Handout
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Review of Measures of Central Tendency, Dispersion & Association • Graphical Excellence • Measures of Central Tendency – Mean, Median, Mode • Measures of Dispersion – Variance, Standard Deviation, Range • Measures of Association – Covariance, Correlation Coefficient • Relationship of basic stats to OLS 1 Graphical Excellence • The graph presents large data sets concisely and coherently. • The ideas and concepts to be delivered are clearly understood to the viewer. • The graph encourages the viewer to compare variables. • The display induces the viewer to address the substance of the data and not the form of the graph. • There is no distortion of what the data reveal. 2 Things to be cautious about when observing a graph: – Is there a missing scale on one axis. – Do not be influenced by a graph’s caption. – Are changes presented in absolute values only, or in percent form too. 3 Numerical Descriptive Measures • Measures of Central Tendency – Mean, Median, Mode • Measures of Dispersion – Variance, Standard Deviation • Measures of Association – Covariance, Correlation Coefficient 4 Arithmetic mean – This is the most popular and useful measure of central location Sum of the measurements Mean = Number of measurements Sample mean x n n ii11xxii nn Sample size Population mean  N i1 x i N Population size 5 • Example 1 The mean of the sample of six measurements 7, 3, 9, -2, 4, 6 is given by  i61 x i x71  x3 2  x93 x24  x45  x66 x   6 6 4.5 • Example 2 Suppose the telephone bills of example 2.1 represent population of measurements. The population mean is 200  i1  x15.30  ...  x53.21 x i x42.19 1 2 200    200 200 43.59 6 • Example 3 When many of the measurements have the same value, the measurement can be summarized in a frequency table. Suppose the number of children in a sample of 16 employees were recorded as follows: NUMBER OF CHILDREN 0 1 2 3 NUMBER OF EMPLOYEES 3 4 7 2 16 employees x 16 i 1 xi 16 x1  x 2 ...  x16 3(0)  4(1)  7(2)  2(3)    1.5 16 16 7  The median – The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude. Example 4 Seven employee salaries were recorded (in 1000s) : 28, 60, 26, 32, 30, 26, 29. Find the median salary. sort of theobservations salaries. OddFirst, number Then, locate the value in26,26,28,29,30,32,60 the middle Suppose one employee’s salary of $31,000 was added to the group recorded before. Find the median salary. First,number sort theofsalaries. Even observations There twothe middle values! Then, are locate values in 26,26,28,29,30,32,60,31 the middle 26,26,28,29, 26,26,28,29, 26,26,28,29, 30,32,60,31 30,32,60,31 29.5 , 30,32,60,31 8  The mode – The mode of a set of measurements is the value that occurs most frequently. – Set of data may have one mode (or modal class), or two or more modes. The modal class For large data sets the modal class is much more relevant than the a singlevalue mode. 9 – Example 5 • The manager of a men’s store observes the waist size (in inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40. • The mode of this data set is 34 in. This information seems valuable (for example, for the design of a new display in the store), much more than “ the median is 33.2 in.”. 10 Relationship among Mean, Median, and Mode • If a distribution is symmetrical, the mean, median and mode coincide • If a distribution is non symmetrical, and skewed to the left or to the right, the three A measures positively skewed distribution (“skewed to the right”) differ. Mode Mean Median 11 ` • If a distribution is symmetrical, the mean, median and mode coincide • If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median A negatively skewed distribution (“skewed to the left”) Mean Mode 12 Median Measures of variability (Looking beyond the average) • Measures of central location fail to tell the whole story about the distribution. • A question of interest still remains unanswered: How typical is the average value of all the measurements in the data set? or How much spread out are the measurements about the average value? 13 Observe two hypothetical data sets Low variability data set The average value provides a good representation of the values in the data set. High variability data set This is the previous data set. It is now changing to... The same average value does not provide as good presentation of the values in the data set as before. 14  The range – The range of a set of measurements is the difference between the largest and smallest measurements. But, how do all the is measurements spread out? it – Its major advantage the ease with which can be computed. ? ? ? The range cannot assistRange in answering this question – Its major shortcoming is its failure to provide Largest Smallest information on the dispersionmeasurement of the values measurement between the two end points. 15  The variance – This measure of dispersion reflects the values of all the measurements. – The variance of a population of N measurements x1, x2,…,xN having a mean  is defined as 2 N ( x   ) i 1 i 2   – The variance of a sample of N n measurements x1, x2, …,xn having a mean is defined as x n 2 2 i 1( xi  x ) s  n 1 16 Consider two small populations: Population A: 8, 9, 10, 11, 12 Population B: 4, 7, 10, 13, 16 9-10= -1 11-10= +1 8-10= -2 12-10= +2 Thus, a measure of dispersion Let us start by calculating is needed agrees with this the sumthat of deviations observation. A 8 9 10 11 12 Sum = 0 The sum of deviations is zero in both cases, therefore, another measure is needed. …but measurements in B The mean of both are much more dispersed populations is 10... then those in A. B 4 7 10 13 4-10 = - 6 16-10 = +6 7-10 = -3 16 13-10 = +3 Sum = 0 17 9-10= -1 11-10= +1 8-10= -2 12-10= +2 The sum of squared deviations is used in calculating the variance. See example next. Sum = 0 The sum of deviations is zero in both cases, therefore, another measure is needed. A 8 9 10 11 12 4-10 = - 6 16-10 = +6 7-10 = -3 B 4 7 10 13 16 13-10 = +3 Sum = 0 18 Let us calculate the variance of the two populations 2 2 2 2 2 ( 8  10 )  ( 9  10 )  ( 10  10 )  ( 11  10 )  ( 12  10 ) 2A  2 5 2 2 2 2 2 ( 4  10 )  ( 7  10 )  ( 10  10 )  ( 13  10 )  ( 16  10 ) B2   18 5 Why is the variance defined as the average squared deviation? Why not use the sum of squared deviations as a measure of After all, the sum of squared dispersion instead? deviations increases in magnitude when the dispersion of a data set increases!! 19 Which data set has a larger dispersion? Let us calculate the sum of squared deviations for both data sets However, when Datacalculated set B on “per observation” basis (variance), is more dispersed the data set around dispersions are properly ranked the mean A B 1 2 3 1 3 SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10 5 times SumB = (1-3)2 + (5-3)2 = 8 5 A2 = SumA/N = 10/5 = 2 5 times ! B2 = SumB/N = 8/2 = 4 20 – Example 6 • Find the mean and the variance of the following sample of measurements (in years). 3.4, 2.5, 4.1, 1.2, 2.8, 3.7 – Solution A shortcut formula i61 xi 3.4  2.5  4.1  1.2  2.8  3.7 17.7 x    2.95 6 6 6 n 2 n 2 n  ( x  x ) (  x ) 1  i i  2 2 i  1 i  1 s    xi    n 1 n  1 i 1 n  =[3.42+2.52+…+3.72]-[(17.7)2/6] = 1.075 (years)2  21 – The standard deviation of a set of measurements is the square root of the variance of the measurements. Sample standard dev iation: s  s2 – Example 4.9 standard dev iation:   2 Population • Rates of return over the past 10 years for two mutual funds are shown below. Which one have a higher level of risk? Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05 Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4 22 – Solution – Let us use the Excel printout that is run from the “Descriptive statistics” sub-menu (use file Xm0410) Fund A Fund A should be considered riskier because its standard deviation is larger Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Fund B 16 Mean 5.295 Standard Error 14.6 Median #N/A Mode 16.74 Standard Deviation 280.3 Sample Variance -1.34 Kurtosis 0.217 Skewness 49.1 Range -6.2 Minimum 42.9 Maximum 160 Sum 10 Count 12 3.152 11.75 #N/A 9.969 99.37 -0.46 0.107 30.6 -2.8 27.8 120 10 23  The coefficient of variation – The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. s Sample coefficient of v ariation: cv  x  Population coefficient of v ariation: CV   – This coefficient provides a proportionate measure of variation. A standard deviation of 10 may be perceived as large when the mean value is 100, but only moderately large when the mean value is 500 24 Interpreting Standard Deviation • The standard deviation can be used to – compare the variability of several distributions – make a statement about the general shape of a distribution. 25 Measures of Association • Two numerical measures are presented, for the description of linear relationship between two variables depicted in the scatter diagram. – Covariance - is there any pattern to the way two variables move together? – Correlation coefficient - how strong is the linear relationship between two variables 26  The covariance Population covariance  COV(X, Y)  (x i   x )(y i   y ) N x (y) is the population mean of the variable X (Y) N is the population size. n is the sample size. Sample covariance  cov(X,Y)  (x i   x )(y i   y ) n-1 27 • If the two variables move the same direction, (both increase or both decrease), the covariance is a large positive number. • If the two variables move in two opposite directions, (one increases when the other one decreases), the covariance is a large negative number. • If the two variables are unrelated, the covariance will be close to zero. 28  The coefficient of correlation Population coefficien t of correlatio n COV ( X, Y)  xy Sample coefficien t of correlatio n cov(X, Y) r sx sy – This coefficient answers the question: How strong is the association between X and Y. 29 +1 Strong positive linear relationship COV(X,Y)>0  or r = or 0 No linear relationship -1 Strong negative linear relationship COV(X,Y)=0 COV(X,Y)<0 30 • If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). • If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship). • No straight line relationship is indicated by a coefficient close to zero. 31 – Example 7 • Compute the covariance and the coefficient of correlation to measure how advertising expenditure and sales level are related to one another. • Base your calculation on the data provided in example 2.3 Advert Sales Shortcut Furmulas1 30 3 40  n x  n y i1 i i1 i  ni1 ( x i  x )(y i  y 5)   ni1 x i y40  i n 4 50 n35 2 2  i1 x n 2 n 2  i1 ( x i  x )   i51 x i  50 n 3 35 32 2 25   • Use the procedure below to obtain the required summations x y xy x2 y2 1 2 3 4 5 6 7 8 1 3 5 4 2 5 3 2 30 40 40 50 35 50 35 25 30 120 200 200 70 250 105 50 1 9 25 16 4 25 9 4 900 1600 1600 2500 1225 2500 1225 625 Sum 25 305 1025 93 Month   2 n  1  i1 x  2 2 x i   sx  n 1  n    s x  1.554  1.458 Similarly, sy = 8.839 cov (X, Y)   ni1 ( x i  x )(y i  y )  n 1 1  n  ni1 x i  ni1 y i   i1 x i y i   n  1  n  1 25  305  1025   10.268   8  12175 7  1 232  93    1.554 7  8  cov (X, Y) 10.268 r   .797 sx sy 1.458  8.839 33 • Excel printout Advertsmnt sales Advertsmnt 2.125 Sales 10.2679 78.125 Covariance matrix • Interpretation Advertsmntsales Advertsmnt 1 Sales 0.7969 1 Correlation matrix – The covariance (10.2679) indicates that advertisement expenditure and sales levelare positively related – The coefficient of correlation (.797) indicates that there is a strong positive linear relationship between advertisement expenditure and sales 34 level. • The Least Squares Method – We are seeking a line that best fit the data – We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. n Minimize( y i  ŷ i ) 2 i1 The actual y value of point i The y value of point i calculated from the equation of the line ŷ i  b0  b1xi 35 Y Errors X Different lines generate different errors, thus different sum of squares of errors. 36 The coefficients b0 and b1 of the line that minimizes the sum of squares of errors are calculated from the data. n b1   ( x  x)( y  y ) i i 1 i n 2 ( x  x )  i , b 0  y  b1 x i 1 n where y  y i 1 n n i and x  x i 1 i n 37
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            