* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Graphical Descriptive Techniques
Survey
Document related concepts
Transcript
Graphical Descriptive Techniques 1 2.1 Introduction Descriptive statistics involves the arrangement, summary, and presentation of data, to enable meaningful interpretation, and to support decision making. Descriptive statistics methods make use of graphical techniques numerical descriptive measures. The methods presented apply to both the entire population the population sample 2 2.2 Types of data and information A variable - a characteristic of population or sample that is of interest for us. Cereal choice Capital expenditure The waiting time for medical services Data - the actual values of variables Interval data are numerical observations Nominal data are categorical observations Ordinal data are ordered categorical observations 3 Types of data - examples Interval data Nominal Age - income 55 42 75000 68000 . . . . Weight gain +10 +5 . . Person Marital status 1 2 3 married single single . . Computer . . Brand 1 2 3 . . IBM Dell IBM . . 4 Types of data - examples Interval data Nominal data With nominal data, all we can do is, calculate the proportion of data that falls into each category. Age - income 55 42 . . 75000 68000 . . gain Weight +10 +5 . . IBM 25 50% Dell Compaq 11 8 22% 16% Other 6 12% Total 50 5 Types of data – analysis Knowing the type of data is necessary to properly select the technique to be used when analyzing data. Type of analysis allowed for each type of data Interval data – arithmetic calculations Nominal data – counting the number of observation in each category Ordinal data - computations based on an ordering process 6 Cross-Sectional/Time-Series Data Cross sectional data is collected at a certain point in time Marketing survey (observe preferences by gender, age) Test score in a statistics course Starting salaries of an MBA program graduates Time series data is collected over successive points in time Weekly closing price of gold Amount of crude oil imported monthly 7 2.3 Graphical Techniques for Interval Data Example 2.1: Providing information concerning the monthly bills of new subscribers in the first month after signing on with a telephone company. Collect data Prepare a frequency distribution Draw a histogram 8 Example 2.1: Providing information Collect data Bills 42.19 38.45 29.23 89.35 118.04 110.46 0.00 72.88 83.05 . . (There are 200 data points Prepare a frequency distribution How many classes to use? Number of observations Less then 50 50 - 200 200 - 500 500 - 1,000 1,000 – 5,000 5,000- 50,000 More than 50,000 Number of classes 5-7 7-9 9-10 10-11 11-13 13-17 17-20 Class width = [Range] / [# of classes] [119.63 - 0] / [8] = 14.95 Largest Largest Largest Largest observation observation observation observation Smallest Smallest Smallest Smallest observation observation observation observation 15 9 Example 2.1: Providing information Draw a Histogram Frequency 80 60 40 20 0 15 30 45 60 75 90 105 120 Bills Bin Frequency 15 71 30 37 45 13 60 9 75 10 90 18 105 28 120 14 10 Example 2.1: Providing information What information can we extract from this histogram 60 40 Bills 120 105 90 75 60 45 0 30 20 15 Frequency About half of all A few bills are in Relatively, the bills are small the middle range large number 13+9+10=32 of large bills 80 71+37=108 18+28+14=60 11 Relative frequency It is often preferable to show the relative frequency (proportion) of observations falling into each class, rather than the frequency itself. Class relative frequency = Class frequency Total number of observations Relative frequencies should be used when the population relative frequencies are studied comparing two or more histograms the number of observations of the samples studied are different 12 Class width It is generally best to use equal class width, but sometimes unequal class width are called for. Unequal class width is used when the frequency associated with some classes is too low. Then, several classes are combined together to form a wider and “more populated” class. It is possible to form an open ended class at the higher end or lower end of the histogram. 13 Shapes of histograms There are four typical shape characteristics 14 Shapes of histograms Negatively skewed Positively skewed 15 Modal classes A modal class is the one with the largest number of observations. A unimodal histogram The modal class 16 Modal classes A bimodal histogram A modal class A modal class 17 Bell shaped histograms • Many statistical techniques require that the population be bell shaped. • Drawing the histogram helps verify the shape of the population in question 18 Interpreting histograms Example 2.2: Selecting an investment An investor is considering investing in one out of two investments. The returns on these investments were recorded. From the two histograms, how can the investor interpret the Expected returns The spread of the return (the risk involved with each investment) 19 Example 2.2 - Histograms 181614121086420- The center for A -15 0 15 30 45 60 75 Return on investment A 181614121086420-15 The center for B 0 15 30 45 60 75 Return on investment B Interpretation: The center of the returns of Investment A is slightly lower than that for Investment B 20 Example 2.2 - Histograms 181614121086420- Sample size =50 17 34 46 -15 0 15 30 45 60 75 Return on investment A Sample size =50 1816141210816 626 4243 0-15 0 15 30 45 60 75 Return on investment B Interpretation: The spread of returns for Investment A is less than that for investment B 21 Example 2.2 - Histograms 181614121086420- -15 0 15 30 45 60 75 Return on investment A 181614121086420-15 0 15 30 45 60 75 Return on investment B Interpretation: Both histograms are slightly positively skewed. There is a possibility of large returns. 22 Providing information Example 2.2: Conclusion It seems that investment A is better, because: Its expected return is only slightly below that of investment B The risk from investing in A is smaller. The possibility of having a high rate of return exists for both investment. 23 Interpreting histograms Example 2.3: Comparing students’ performance Students’ performance in two statistics classes were compared. The two classes differed in their teaching emphasis Class A – mathematical analysis and development of theory. Class B – applications and computer based analysis. The final mark for each student in each course was recorded. Draw histograms and interpret the results. 24 Interpreting histograms Frequency Histogram 40 20 0 50 60 Frequency The mathematical emphasis creates two groups, and a larger spread. 70 80 90 100 90 100 Marks(Manual) Histogram 40 20 0 50 60 70 80 Marks(Computer) 25 2.5 Describing the Relationship Between Two Variables We are interested in the relationship between two interval variables. Example 2.7 A real estate agent wants to study the relationship between house price and house size Twelve houses recently sold are sampled and Size Price there size and price recorded 23 315 Use graphical technique to describe the 18 229 relationship between size and price. 26 335 20 261 …………….. …………….. 26 2.5 Describing the Relationship Between Two Variables Solution The size (independent variable, X) affects the price (dependent variable, Y) We use Excel to create a scatter diagram Y 400 300 200 100 X 0 0 10 20 30 40 27 Typical Patterns of Scatter Diagrams Positive linear relationship No relationship Negative nonlinear relationship Negative linear relationship Nonlinear (concave) relationship This is a weak linear relationship. A non linear relationship seems to fit the data better. 28 2.6 Describing Time-Series Data Data can be classified according to the time it is collected. Cross-sectional data are all collected at the same time. Time-series data are collected at successive points in time. Time-series data is often depicted on a line chart (a plot of the variable over time). 29 Line Chart Example 2.9 The total amount of income tax paid by individuals in 1987 through 1999 are listed below. Draw a graph of this data and describe the information produced 30 Line Chart Line Chart 1,200,000 1,000,000 800,000 600,000 400,000 200,000 0 87 88 89 90 91 92 93 94 95 96 97 98 99 For the first five years – total tax was relatively flat From 1993 there was a rapid increase in tax revenues. Line charts can be used to describe nominal data time series. 31 Numerical Descriptive Techniques 32 4.2 Measures of Central Location Usually, we focus our attention on two types of measures when describing population characteristics: Central location (e.g. average) Variability or spread The measure of central location reflects the locations of all the actual data points. 33 4.2 Measures of Central Location The measure of central location reflects the locations of all the actual data points. With two data points, How? the central location But if theshould third data With one data point fall inpoint the middle on the leftthem hand-side clearly the centralappears between (in order of the midrange, it should “pull”of location is at the point to reflect the location the central location to the left. itself. both of them). 34 The Arithmetic Mean This is the most popular and useful measure of central location Sum of the observations Mean = Number of observations 35 The Arithmetic Mean Sample mean x n n ii11xxii nn Sample size Population mean N i1 x i N Population size 36 The Arithmetic Mean The arithmetic mean • Example 4.1 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. x 10 i 1 xi 10 0x1 7x2 ... 22 x10 11.0 10 • Example 4.2 Suppose the telephone bills of Example 2.1 represent the population of measurements. The population mean is x42.19 x38.45 ... x45.77 i200 1 2 200 1 x i 200 200 43.59 37 The Median The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. Example 4.3 Comment Find the median of the time on the internet Suppose only 9 adults were sampled (exclude, say, the longest time (33)) for the 10 adults of example 4.1 Even number of observations 0, 0, 5, 0, 7, 5, 8, 7, 8, 9, 12, 14,14, 22,22, 33 33 8.59,, 12, Odd number of observations 0, 0, 5, 7, 8 9, 12, 14, 22 38 The Mode The Mode of a set of observations is the value that occurs most frequently. Set of data may have one mode (or modal class), or two or more modes. The modal class For large data sets the modal class is much more relevant than a single-value mode. 39 The Mode The Mean, Median, Mode The Mode Example 4.5 Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution All observation except “0” occur once. There are two “0”. Thus, the mode is zero. Is this a good measure of central location? The value “0” does not reside at the center of this set (compare with the mean = 11.0 and the mode = 8.5). 40 Relationship among Mean, Median, and Mode If a distribution is symmetrical, the mean, median and mode coincide If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median 41 Relationship among Mean, Median, and Mode If a distribution is symmetrical, the mean, median and mode coincide If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) A negatively skewed distribution (“skewed to the left”) Mode Mean Median Mean Mode Median 42 The Geometric Mean This is a measure of the average growth rate. Let Ri denote the the rate of return in period i (i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods. 43 The Geometric Mean The Geometric Mean For the given series of rate of returns the nth period return is calculated by: If the rate of return was Rg in every period, the nth period return would be calculated by: n (1 R1 )(1 R 2 )...( 1 R n ) (1 R g ) Rg is selected such that… Rg n (1 R1)(1 R2 )...(1 Rn ) 1 44 4.3 Measures of variability Measures of central location fail to tell the whole story about the distribution. A question of interest still remains unanswered: How much are the observations spread out around the mean value? 45 4.3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. This data set is now changing to... 46 4.3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. Larger variability The same average value does not provide as good representation of the observations in the data set as before. 47 The range The range of a set of observations is the difference between the largest and smallest observations. But, how do all the observations spread out? Its major advantage is the ease with which it can be computed. ? ? ? The range cannot assistRange in answering this question Largest to Smallest Its major shortcoming is its failure observation observation provide information on the dispersion of the observations between the two end points. 48 The Variance This measure reflects the dispersion of all the observations The variance of a population of size N x1, x2,…,xN whose mean is is defined as 2 2 N ( x ) i 1 i N The variance of a sample of n observations x1, x2, …,xn whose mean is x is defined as s2 ni1( xi x)2 n 1 49 Why not use the sum of deviations? Consider two small populations: 9-10= -1 11-10= +1 8-10= -2 12-10= +2 A measure of dispersion A Can the sum of deviations agreesofwith this Be aShould good measure dispersion? The sum of deviations is observation. zero for both populations, 8 9 10 11 12 therefore, is not a good …but Themeasurements mean of both in B measure of arepopulations moredispersion. dispersed is 10... 4-10 = - 6 16-10 = +6 7-10 = -3 then those in A. B 4 Sum = 0 7 10 13 16 13-10 = +3 Sum =500 The Variance Let us calculate the variance of the two populations 2 2 2 2 2 2 (8 10) (9 10) (10 10) (11 10) (12 10) A 2 5 2 2 2 2 2 2 (4 10) (7 10) (10 10) (13 10) (16 10) B 18 5 Why is the variance defined as the average squared deviation? Why not use the sum of squared deviations as a measure of variation instead? After all, the sum of squared deviations increases in magnitude when the variation of a data set increases!! 51 The Variance Let us calculate the sum of squared deviations for both data sets Which data set has a larger dispersion? Data set B is more dispersed around the mean A B 1 2 3 1 3 5 52 The Variance SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10 SumB = (1-3)2 + (5-3)2 = 8 SumA > SumB. This is inconsistent with the observation that set B is more dispersed. A B 1 2 3 1 3 5 53 The Variance However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked. A2 = SumA/N = 10/5 = 2 B2 = SumB/N = 8/2 = 4 A B 1 2 3 1 3 5 54 The Variance Example 4.7 The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance Solution 6 xi x i 1 ni1( x i 6 17 15 23 7 9 13 84 14 jobs 6 6 2 x ) 1 2 s (17 14)2 (15 14)2 ...(13 14)2 n 1 6 1 33.2 jobs2 55 Standard Deviation The standard deviation of a set of observations is the square root of the variance . Sample standard dev iation: s s 2 Population standard dev iation: 2 56 Standard Deviation Example 4.8 To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club. The distances were recorded. Which 7-iron is more consistent? 57 The Standard Deviation Standard Deviation Example 4.8 – solution Excel printout, from the “Descriptive Statistics” submenu. The innovation club is more consistent, and because the means are close, is considered a better club Current Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Innovation 150.5467 0.668815 151 150 5.792104 33.54847 0.12674 -0.42989 28 134 162 11291 75 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 150.1467 0.357011 150 149 3.091808 9.559279 -0.88542 0.177338 12 144 156 11261 75 58 Interpreting Standard Deviation The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a distribution. The empirical rule: If a sample of observations has a mound-shaped distribution, the interval ( x s, x s) contains approximately 68% of the measuremen ts ( x 2s, x 2s) contains approximately 95% of the measuremen ts ( x 3s, x 3s) contains approximately 99.7% of the measuremen ts 59 Interpreting Standard Deviation Example 4.9 A statistics practitioner wants to describe the way returns on investment are distributed. The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped. 60 Interpreting Standard Deviation Example 4.9 – solution The empirical rule can be applied (bell shaped histogram) Describing the return distribution Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)] Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)] Approximately 99.7% of the returns lie between -14% and 61 34% [10 – 3(8), The Chebysheff’s Theorem The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/k2 for k > 1. This theorem is valid for any set of measurements (sample, population) of any shape!! x s, x s (1-1/12) xInterval 2s, x 2s K Chebysheff Empirical (1-1/22) Rulex 3s, x 3s (1-1/32) 1 at least 0% approximately 68% 62 The Chebysheff’s Theorem Example 4.10 The annual salaries of the employees of a chain of computer stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain? Solution At least 75% of the salaries lie between $22,000 and $34,000 28000 – 2(3000) 28000 + 2(3000) At least 88.9% of the salaries lie between $$19,000 63 The Coefficient of Variation The coefficient of variation of a set of measurements is the standard deviation s divided by the mean value. Sample coefficien t of variation : cv x Population coefficien t of variation : CV This coefficient provides a proportionate A standard deviation of 10 may be perceived measure of variation. large when the mean value is 100, but only moderately large when the mean value is 500 64 4.4 Measures of Relative Standing and Box Plots Percentile The pth percentile of a set of measurements is the value for which p percent of the observations are less than that value 100(1-p) percent of all the observations are greater than that value. Example Suppose your score is the 60% percentile of a 40% 60% of all the scores lie here SAT test. Then Your score 65 Quartiles Commonly used percentiles First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Second (middle)quartile,Q2, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile 66 Quartiles Example Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8 67 Quartiles Solution Sort the observations 2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30 The first quartile 15 observations At most (.25)(15) = 3.75 observations should appear below the first quartile. Check the first 3 observations on the left hand side. At most (.75)(15)=11.25 observations should appear above the first quartile. Check 11 observations on the right hand side. Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations. 68 4.5 Measures of Linear Relationship The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. Covariance - is there any pattern to the way two variables move together? Coefficient of correlation - how strong is the linear relationship between two variables 69 Covariance Population covariance COV(X, Y) (x i x )(y i y ) N x (y) is the population mean of the variable X (Y). N is the population size. (xi x)(y i y) Sample cov ariance cov (x y, ) n-1 x (y) is the sample mean of the variable X (Y). n is the sample size. 70 Covariance Compare the following three sets xi yi (x – x) (y – y) (x – x)(y – y) 2 6 7 13 20 27 -3 1 2 -7 0 7 21 0 14 x=5 y y=20 i (x – x) (y – y) Cov(x,y)=17. (x 5 – x)(y – -3 1 2 7 0 -7 -21 0 -14 xi 2 6 7 27 20 13 y) xi yi 2 6 7 20 27 13 x= 5 y =20 Cov(x,y) = 3.5 71 Covariance If the two variables move in the same direction, (both increase or both decrease), the covariance is a large positive If the twonumber. variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number. If the two variables are unrelated, the covariance will be close to zero. 72 The coefficient of correlation Population coefficien t of correlatio n COV ( X, Y) xy Sample coefficien t of correlatio n cov(X, Y) r sx sy This coefficient answers the question: How strong is the association between X and Y. 73 The coefficient of correlation +1 Strong positive linear relationship COV(X,Y)>0 or r = or 0 No linear relationship -1 Strong negative linear relationship COV(X,Y)=0 COV(X,Y)<0 74 The coefficient of correlation If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship). No straight line relationship is indicated75 The coefficient of correlation and the covariance – Example 4.16 Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another. Solution We believe GMAT affects GPA. Thus GMAT is labeled X GPA is labeled Y 76 The coefficient of correlation and the covariance – Example 4.16 Student 1 x 599 y 9.6 x2 y2 xy 358801 92.16 5750.4 2 689 8.8 474721 77.44 6063.2 cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16 3 584 7.4 341056 54.76 4321.6 Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56 4 100 6310 Sy =…………………………………………………. similar631 to Sx =10 1.12 398161 593 xSy = 26.16/(43.56)(1.12) 8.8 351649 77.44 r = 11 cov(x,y)/S = .5362 5218.4 12 683 8 466489 64 5464 Total 7,587 106.4 4,817,755 957.2 67,559.2 Shortcut Formulas cov(x, y ) xi y i 1 xi y i n 1 n 2 1 x 2 s2 x i n 1 n 77 The coefficient of correlation and the covariance – Example 4.16 – Excel Use the Covariance option in Data Analysis If your version of Excel returns the population covariance and variances, multiply each one by n/n-1 to obtain the corresponding sample values. Use the Correlation option to produce the correlation matrix. Variance-Covariance Matrix Population values GPA GPA 1.15 GMA T 23.98 GMA T 1739. 52 Sample values GPA 12 12-1 GPA 1.25 GMA T 26.16 GMA T 1897. 66 78 The coefficient of correlation and the covariance – Example 4.16 – Excel Interpretation The covariance (26.16) indicates that GMAT score and performance in the MBA program are positively related. The coefficient of correlation (.5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA. 79