Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part I: Introduction to Data Analysis The true foundation of theology is to ascertain the character of God. It is by the aid of Statistics that law in the social sphere can be ascertained and codified, and certain aspects of the character of God thereby revealed. The study of statistics is thus a religious service. — Florence Nightingale (1820-1910). Statistical Thinking is understanding variation and how to deal with it. In this course we explore methods for moving as far as possible to the right on this continuum: Ignorance --> Uncertainty --> Risk --> Certainty Types of Data Categorical vs. Numerical Discrete vs. Continuous Nominal Data are the weakest type of measurement for statistical methods. They can be numbers, but really are just names or labels (not quantities). Same as Categorical. Ordinal Data, by their size, rank or order observations on some basis. These intervals between these numbers, and their ratios, are meaningless. Interval Data also rank observations according to some dimension, but the interval or distance between observations has a constant meaning. Readings on the Fahrenheit temperature scale are examples of interval data; the zero point is somewhat arbitrary, but a difference of, say, ten degrees means the same thing everywhere on the scale. We can do addition and subtraction with interval data, but not multiplication or division. Rational Data are the most useful type for statistical analysis. Ratio data are numbers which by their size rank observations in order of importance and between which intervals as well as ratios are meaningful. All types of arithmetic operations can be performed with rational data. Example 1 1998 New York Yankees Roster No. 2 11 14 18 19 20 20 21 22 24 25 26 26 27 28 29 31 33 36 39 40 42 43 46 51 54 55 58 Last Jeter Knoblauch Irabu Brosius Sojo Posada Davis O'Neill Bush Martinez Girardi Hernandez Spencer Lloyd Curtis Stanton Raines Wells Cone Strawberry Holmes Rivera Nelson Pettitte Williams Borowski Mendoza Jerzembeck First Derek Chuck Hideki Scott Luis Jorge X-Chili Paul Homer Tino Joe Orlando Shane Graeme Chad Mike Tim David David Darryl X-Darren Mariano X-Jeff Andy Bernie Joe Ramiro Mike Position Infield Infield Pitcher Infield Infield Catcher Outfield Outfield Infield Infield Catcher Pitcher Outfield Pitcher Outfield Pitcher Outfield Pitcher Pitcher Outfield Pitcher Pitcher Pitcher Pitcher Outfield Pitcher Pitcher Pitcher Bats R R R R R S S L R L R R R L R L S L L L R R R L S R R R 2 Throws R R R R R R R L R R R R R L R L R L R L R R R L R R R R Ht. 6-3 5-9 6-4 6-1 5-11 6-2 6-3 6-4 5-10 6-2 5-11 6-2 5-11 6-7 5-10 6-1 5-8 6-4 6-1 6-6 6-0 6-2 6-8 6-5 6-2 6-2 6-2 6-1 Wt. 185 170 240 202 175 205 220 215 175 210 195 190 210 234 185 215 186 225 190 215 202 168 235 235 205 225 154 185 Born 6/26/74 7/7/68 5/5/69 8/15/66 1/3/66 8/17/71 1/17/60 2/25/63 11/12/72 12/7/67 10/14/64 10/11/69 2/20/72 4/9/67 11/6/68 6/2/67 9/16/59 5/20/63 1/2/63 3/12/62 4/25/66 11/29/69 11/17/66 6/15/72 9/13/68 5/4/71 6/15/72 5/18/72 Example 2 THE WORLD COMPETITIVENESS SCOREBOARD Source: IMD - International Institute for Management Development, Lausanne, Switzerland (http://www.imd.ch/) Country 2000 1999 1998 1997 1996 1995 USA 1 1 1 1 1 1 Singapore 2 2 2 2 2 2 Finland 3 3 5 4 15 18 Netherlands 4 5 4 6 7 8 Switzerland 5 6 7 7 9 5 Luxembourg 6 4 9 12 8 Ireland 7 11 11 15 22 22 Germany 8 9 14 14 10 6 Sweden 9 14 17 16 14 12 Iceland 10 17 19 21 25 25 Canada 11 10 10 10 12 13 Denmark 12 8 8 8 5 7 Australia 13 12 15 18 21 16 Hong Kong 14 7 3 3 3 3 UK 15 15 12 11 19 15 Norway 16 13 6 5 6 10 Japan 17 16 18 9 4 4 Austria 18 19 22 20 16 11 France 19 21 21 19 20 19 Belgium 20 22 23 22 17 21 New Zealand 21 20 13 13 11 9 Taiwan 22 18 16 23 18 14 Israel 23 24 25 26 24 24 Spain 24 23 27 25 29 28 Malaysia 25 27 20 17 23 23 Chile 26 25 26 24 13 20 Hungary 27 26 28 36 39 41 Korea 28 38 35 30 27 26 Portugal 29 28 29 32 36 32 Italy 30 30 30 34 28 29 China 31 29 24 27 26 31 Greece 32 31 36 37 40 40 Thailand 33 34 39 29 30 27 Brazil 34 35 37 33 37 38 Slovenia 35 40 Mexico 36 36 34 40 42 42 Czech Rep 37 41 38 35 34 39 South Africa 38 42 42 44 44 43 Philippines 39 32 32 31 31 36 Poland 40 44 45 43 43 45 Argentina 41 33 31 28 32 30 Turkey 42 37 33 38 35 35 India 43 39 41 41 38 37 Colombia 44 43 44 42 33 33 Indonesia 45 46 40 39 41 34 Venezuela 46 45 43 45 45 44 Russia 47 47 46 46 46 46 3 Example 3 a. Ballard Power Systems, Inc. stock has risen in price by $107 per share in five years. b. Ballard Power Systems, Inc. stock has risen in price from $8 to $115 per share in five years. Operational Definitions An important concept, perhaps difficult to measure (e.g. the overall health of the U.S. equity market), is often operationalized with an easy-to-measure proxy (e.g. the Dow Jones Industrial Average). Sampling One of the fundamental principles of statistics is that we can learn a great deal about a complete population of data by looking at a smaller subset, or sample, from the population. Types of Samples Nonprobability Judgement Quota Chunk Probability Simple Random Systematic Stratified Cluster 4 Getting Started in Microsoft Excel Frequency Distribution Focus National Liberal Arts National University Regional Liberal Arts Regional University Count 2 8 18 32 Percentage Distribution Focus National Liberal Arts National University Regional Liberal Arts Regional University Count Percent 2 3.33% 8 13.33% 18 30.00% 32 53.33% Graphs and Charts History Johann Heinrich Lambert (1728-1777) was a Swiss-German scientist and mathematician. He is generally recognized as the inventor of the time series graph, in which the values of some variable of interest are plotted against the vertical axis and time is plotted on the horizontal axis. William Playfair (1759-1823) was a Scottish political economist. He advocated the use of charts instead of tables of data, because "a man who has carefully investigated a printed table, finds, when done, that he has only a very faint and partial idea of what he has read". Playfair also invented the bar graph. Florence Nightingale (1820-1910) was a British Army nurse in the Crimean War (1854). She used graphical tools to convince army officers to improve conditions in military hospitals. In 1860 she offered to fund a chair in applied statistics at Oxford, and was turned down. Edward Tufte (1946- ) is a professor of political science, statistics, and computer science at Yale. He has written several excellent books about statistics and graphic design. Personal Computers and Integrated Software such as the Microsoft Excel, PowerPoint, and Word programs used by most students in this class, have greatly simplified the creation of graphs and their use in documents and multimedia presentations. An unfortunate side effect has been to limit people's creativity in creating graphs. 5 Types of Charts Frequency Bar Chart Focuses of 60 Texas Universities 35 30 School Focus 25 20 15 10 5 0 National Liberal Arts National University Regional Liberal Arts Regional University Frequency Pie Chart Focuses of 60 Texas Universities National Liberal Arts 3% National University 13% Regional University 54% Regional Liberal Arts 30% 6 Pareto Diagram DeBurr Cut Engrave Grind Weld Cost Cumulative Cost Cumulative % $ 8,181.25 $ 8,181.25 52.5% $ 5,950.00 $ 14,131.25 90.7% $ 848.75 $ 14,980.00 96.2% $ 446.25 $ 15,426.25 99.0% $ 148.75 $ 15,575.00 100.0% Pareto Diagram 5 Types of Manufacturing Defects 100% 90% $14,000 80% $12,000 70% Cost ($) 60% $8,000 50% $6,000 40% 30% $4,000 20% $2,000 10% $- 0% DeBurr Cut Engrave Defect Type 7 Grind Weld Cumulative % $10,000 Histogram Texas Tuitions - 60 Universities 30 25 Frequency 20 15 10 5 0 1 3 5 7 9 11 13 Tuition ($1000) Scatter Plot Education vs. Income $140,000 $120,000 Income ($) $100,000 $80,000 $60,000 $40,000 $20,000 $0 5 10 15 Education (Years) 8 20 25 Here is a time-series graph from March, 1999, showing the growth of the Dow Jones Industrial Average during the 1990s. Note how the minimum value on the vertical axis has been set to accentuate the Dow's growth — a mild example of lying with charts. 9 Here is another example of lying with charts. The proportion of the number of titles in the Barnes and Noble database to the number in Amazon's is evidently 8,000,000 to 4,700,000, or about 170%. But this one-dimensional relationship is distorted in the two-dimensional graph. The area of Barnes and Noble's black bar is 2700 square centimeters, while the area of Amazon's gray bar is 800 square centimeters. This gives the visual impression that the proportion of titles is more like 340%. The distortion is augmented by the choice of color: Barnes and Noble looks bold, clear and strong, while Amazon looks washed-out, pale, and weak. 10 Here's an example of a graphical technique that you can't do with Excel. In this NEW YORK TIMES map of Kosovo, colors and shapes are used creatively to communicate complicated quantitative information simply and clearly (e.g. the volume and direction of refugee movements over time). 11 Juran's Suggestions for Good Charts General Label all axes with the variable name and units. Don't use a legend for univariate charts (charts with only one variable). Put the dependent variable on the vertical (Y) axis and the independent variable on the horizontal (X) axis. (We will discuss dependent and independent variables in greater detail later in the course.) Let horizontal and vertical axes start at zero unless you have a good reason not to. Keep your scales, colors, patterns, and symbols consistent. Eschew fancy effects that do not contribute to the reader's understanding (e. g. 3D effects, distracting colors or patterns, etc.). Watch your ink-to-information ratio (see Tufte). Keep it simple. Don't present data that aren't central to the point you are making. Don't rely on the reader to infer the point of your chart; state your point explicitly in the text. Pareto Charts Let the left vertical axis show the values for the various categories, and be scaled so the maximum value corresponds to the total of all categories. Let the right vertical axis show the cumulative percent, and be scaled so that the maximum value is 100%. Histograms Don't let Excel decide what values to use for the class boundaries (a.k.a. bin or bucket boundaries). Specify them yourself. The proper number of classes is subjective; try to use between six and ten. Don't use the upper class boundary as the category label on the X-axis. Use the class midpoint to avoid confusion. The default Excel column chart has gaps between the columns; these make a histogram harder to read. Double-click on one of the columns, select "Options", and reduce the gap width to zero. 12 Descriptive Statistics Measures of Central Tendency 1) Average or Arithmetic Mean. Example: The annual salaries (in $1000s) of the seven employees of a small government department are as follows: 48, 90, 46, 42, 40, 46, 49. The mean is: = (48 + 90 + 46 + 42 + 40 + 46 + 49)/7 = (361/7) = 51.571 The mean salary is therefore $51,571. We use the Greek letter mu () to symbolize the mean. Notation: We will sometimes use a mathematical shorthand notation called Summation Notation. It is easy to use and should not scare anyone; ask for help if you need it. If we have 7 data points, we can abstractly write these numbers X1, X2,..., X7 (where X1 = 48, X2 = 90, ... X7 = 49). Then we write the average of N = 7 numbers as: Average of (X1, X2, ..., XN) = = X 1 X 2 X 3 ... X N N N We can also write the average: Where Xi i 1 N N X i X 1 X 2 X 3 ... X N 48 90 ... 49 i 1 48 + 90 + ... + 49 = 361, so the average or mean is 361/7 = 51.571 or $51,571. 13 2) Median The median of a data set is the “middle” value; the value such that 50% of the population lies above and below it. To find the median salary, first arrange the salaries in ascending order: 40, 42, 46, 46, 48, 49, 90. The median salary is the middle value. In this case, it is $46,000, which (at least here) seems more representative of a typical salary than the mean value ($51,571). This worked nicely because we had an odd number of observations. Suppose we want to find the median of the following: 48, 90, 46, 42, 40, 46, 49, 51. For an even number of observations, the median is the average of the two middle values. In this case, the average of 46 and 48, that is $47,000. 3) Mode The mode of a data set is the “most popular” value or the value with highest frequency. Example: The manager of a men's store observes that the 10 pairs of trousers sold yesterday have the following waist sizes (in inches): 31, 34, 36, 33, 28, 34, 30, 34, 32, 40. The mode of these waist sizes is 34 inches, and this fact is undoubtedly of more interest to the manager than are the facts that the mean waist size is 33.2 inches and the median is 33.5 inches. 14 Measures of Dispersion 1) Range = maximum value - minimum value. In the above example, the range is 90 - 40 = 50. 2) Quartiles, Interquartile Range Top 20 U.S. Banks (by Total Assets) Bank 1 Bank of America 2 Chase Manhattan Bank 3 Citibank 4 First Union National Bank 5 Morgan Guaranty Trust Company 6 Wells Fargo Bank 7 Bank One 8 Fleet National Bank 9 HSBC Bank 10 BankBoston 11 U.S. Bank 12 Keybank 13 Bank of New York 14 PNC Bank 15 Wachovia Bank 16 State Street Bank And Trust Co. 17 Bankers Trust Company 18 Southtrust Bank 19 AmSouth Bank 20 Regions Bank As Of 3/31/2000; source: http://www.ffiec.gov City Charlotte New York New York Charlotte New York San Francisco Chicago Providence Buffalo Boston Minneapolis Cleveland New York Pittsburgh Winston-Salem Boston New York Birmingham Birmingham Birmingham Assets ($Billion) 571.7 332.2 327.9 229.3 167.7 96.3 93.9 87.7 79.6 78.3 75.4 75.0 71.8 68.2 63.6 56.2 51.2 43.2 43.2 42.2 Quartiles are used to divide a data set into four pieces; they can be thought of as statistical dividing lines between these pieces. You will discover that there are differences in the way statisticians calculate these dividing lines; here we will illustrate the method used in the Excel QUARTILE function. For a list of n numbers, first sort the numbers in increasing order and figure out how many data there are. In this case, n = 20. In the Excel method, the first quartile is the number that is three quarters of the way from the fifth observation (from the bottom) to the sixth. The fifth is 56,226,197 (State Street Bank), the sixth is 63,557,835 (Wachovia), and the first quartile is: 3 56.2 63.6 56.2 4 56.2 5.5 $61.7 billion 15 The second quartile is the number that is half way from the tenth observation (from the bottom) to the eleventh. The tenth is 78.3 (BankBoston), the eleventh is 75.4 (U.S. Bank), so the second quartile is: 1 75.4 78.3 75.4 2 75.4 1.5 $76.9 billion The third quartile is the number that is one quarter of the way from the fifteenth observation (from the bottom) to the sixteenth. The fifteenth is 96.3 (Wells Fargo), the sixteenth is 167.7 (Morgan), so the third quartile is: 1 96.3 167.7 96.3 4 96.3 17.8 $114.2 billion The interquartile range is the difference between the third and first quartile: 114.2 - 61.7 = $52.5 billion. Percentiles are like quartiles, except they are dividing lines between hundredths of the data instead of fourths. The 25th percentile is the same as the 1st quartile, the 50th percentile is the same as the 2nd quartile, and the 75th percentile is the same as the 3rd quartile. Quartiles can be used to create a type of chart called a Box Plot, or Box and Whisker Plot, as in this example from the Texas College data: Notice that the box plot allows us to compare central tendency and dispersion across several variables in one chart. Here we can see how tuition varies across four different types of schools. Unfortunately, Excel can't help you with box plots very well (these were created in SPSS, a popular statistics software package). 16 3) Variance: The average of the squared deviations of values from the arithmetic mean. Example: To calculate the variance of the above 7 governmental salaries, first calculate the mean; it is 51.571. Then for each number, calculate its deviation from the mean, so we get 48 - 51.571 = -3.57, 90 - 51.571 = 38.43, and so forth..., 49 - 51.571 = -2.57. Add the squares of these together, and we get (-3.57)2 + (38.43)2 + ... + (-2.57)2 = 1,783. Then dividing by 7 we get 254.82. The variance of the above salaries is 254.82($2). Using summation notation this is: 2 1 X 1 2 X 2 2 X 3 2 ... X N 2 N 1 N X i 2 N i 1 (Beware of the units of the variance, it is in the original units squared.) 4) Standard deviation = 2 = . This can be thought of as the “average” deviation from the mean. It is simply the square root of the variance: 2 254.82 = $15.96 17 Example: A school system employs teachers at salaries between $28,000 and $50,000. The teachers' union and the school board are negotiating the form of next year's salary increases. 1. If every teacher is given a flat $1000 raise, what will this do to the mean salary? 2. To the median salary? 3. To the range? 4. To the quartiles of the salary distribution? 5. What would a flat $1000 raise do to the standard deviation of teachers' salaries? 6. If, instead, each teacher receives a 5% raise, what will this do to the mean salary? 7. To the median salary? 8. Will the 5% raise increase the standard deviation of the salaries? 18 Population versus Sample A population is usually a group we want to know something about: e.g., all potential customers, all eligible voters, all the products coming off an assembly line, all items in inventory, etc.... A population parameter is a number relevant to the population that is of interest to us: e.g., the proportion (in the population) that would buy a product, the proportion of eligible voters who will vote for a candidate, the average number of M&M's in a pack.... A sample is a subset of the population that we actually do know about (by taking measurements of some kind): e.g., a group who fill out a survey, a group of voters that are polled, a number of randomly chosen items off the line.... A sample statistic is often the only practical estimate of a population parameter. In practice we will use sample statistics as proxies for population parameters, but it is important to remember the difference. Sample Mean and Variance: To determine the average amount of money spent in the Central Mall, a Central City official randomly samples 12 people as they exit the mall. He asks them the amount of money spent and records the data. The official is trying to estimate mean and variance of the population from a sample of 12 data points. Here are the data for the 12 people: Person 1 2 3 4 $ spent $132 $334 $33 $10 Person 5 6 7 8 $ spent $123 $5 $6 $14 19 Person 9 10 11 12 $ spent $449 $133 $44 $1 Sample Means, Variances and Standard Deviations: A sample (x1, x2, ... , xn) has sample mean, sample variance, and sample standard deviation as follows: n Sample Mean X X 2 ... X n X 1 n X i X n Sample Variance s2 Xi i 1 n 2 i 1 n1 Note: The denominator of the sample variance formula is n - 1, not n. This is because of the aforementioned distinction between population parameters and sample statistics. The n - 1 formula for s2 tends to gives a better estimate of 2. X i X n Sample Standard Deviation s s2 2 i 1 n1 Example: The sample mean is X X 1 X 2 ... X n 132 334 33 ... 1 $1 ,284 $107 n 12 12 The sample variance is s 2 132 107 229,394 11 20 ,854 $ 2 2 334 107 2 ... 1 107 2 11 The sample standard deviation is s s 2 20 ,854 $144.40 So we estimate that on average $107 are spent per shopper with a standard deviation of $144.40. From now on we will be working almost exclusively with sample data. Population data are usually not easily obtained. 20 The Coefficient of Variation Quality Application: Suppose we have two machines producing pipes, one of small diameter (4 inches) the other of larger diameter (30 inches). Due to imperfections in the production processes, the small pipes that come of the line do not all have exactly 4 inches in diameter. Some differ by as much as 0.1 inches in either direction. We calculate that the mean diameter is 4.0 inches and the standard deviation of the pipe diameters is 0.05 inches. For the larger pipes, the mean diameter is 30.0 inches, and the standard deviation is also 0.05 inches. By comparing standard deviations we would say the quality of the output is identical; the same variability exists. However, in relative terms they differ. This is where the coefficient of variation (CV) is useful. The CV measures the variation relative to the value of the mean. For a sample with mean X and standard deviation s: CV s X Usually the CV is multiplied by 100 and stated as a percentage. For the smaller pipes, the CV is 0.05/4.0 = 0.0125 = 1.25%. For the larger pipes, the CV is 0.05/30.0 = 0.001667 = 0.17%. Thus the variability of the larger pipes is not as great as that of the smaller pipes, relative to their diameter. Their quality, one could say, is therefore better. Example: Individual firms in the toy industry find that their annual growth rates in sales tend to fluctuate substantially from year to year, because of changing fads. In comparison, the growth rate in total industry sales remains relatively stable. Percentage Growth in Sales Company Mattel Tonka Industry 1980 +13.7% -21.7% +8.4% 1981 +23.9% +4.1% +14.0% 1982 +18.3% -22.9% +28.9% 1983 -52.8% +8.3% -19.9% 1984 +39.1% +58.3% +50.0% Here are the relevant statistics (calculated using sample formulae (why?)): Company Mattel Tonka Industry Sample Mean 8.44% 5.22% 16.28% Sample Std. Dev. 35.5% 33.0% 25.9% We see from the standard deviations that the individual toy companies have slightly larger standard deviations. What do we mean by slightly? What are we 21 measuring against? For Mattel, the CV is 35.5/8.44 = 4.21 or 421%. For Tonka, the CV is 33.0/5.22 = 6.32 or 632%. For the industry as a whole, the CV is 25.9/16.28 = 1.59 or 159%. The industry’s growth rate actually is a lot less variable than that of the individual firms, when measured against the average growth rate. Selected Bibliography Bernstein, Peter L. (1996). Against the Gods: The Remarkable Story of Risk. New York: John Wiley and Sons. Paulos, John Allen (1995). A Mathematician Reads the Newspaper. New York: Basic Books Paulos, John Allen (1998). Once Upon a Number. New York: Basic Books. Tufte, Edward (1983). The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press. 22