Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Business and Economics: Types of data and descriptives STT 315: Section 201 Instructor: Abdhi Sarkar What is statistics? We see it everyday and rely on it. It is information that can be derived from data. With sound intuition and mathematical tools we are able to historically state what has occurred until now and have the ability to predict and project into the future. Statistics has tremendous applications in almost every scientific field and prolific applications specifically in business and economics. Population and Sample • Suppose we would like to estimate the fraction of East Lansing residents who are students. • In this case, the population is all East Lansing residents. • However, surveying the entire population may be costly, time-consuming and laborious and therefore, we can do our job by selecting a sample which is “a good representative of the population”. 3 Parameter and Statistic • Parameters are the values we calculate from the population data. Population mean, population variance, population median etc. are the examples of parameters. • Statistics - a word with 2 meanings – A subject, like mathematics or physics. – Values we compute from sample data. Sample mean, sample variance, sample proportion etc. are the examples of statistics. Singular of statistics is “statistic”. 4 There are four basic processes in statistics: 1. 2. 3. 4. Data Collection Data Organization Data Analyses Interpretation of the Analyses There are two broad categories of data: a. Qualitative / Categorical Example: Hair Color, Hometown, Nationality, Types of Cars, Yes or No questions, Blood Type etc. b. Quantitative / Numerical Example: Height of a person, Car Mileage, Annual Income, Age, Property values, etc. Statistical Methods Descriptive Statistics • Involves collection of data • Organization of data Usually the data organization is in the form of carefully tabulating and graphing plots and figures. It also involves summarizing characteristics in terms of Mean, Median, Mode etc. Inferential Statistics • Point Estimation and Interval estimation • Testing of Hypotheses Here lies the true essence of statistics in terms of understanding characteristics of a certain population in terms of the sampled data observed from it. Descriptive Statistics In order to visualize the observed or collected data, first we require to organize it. Once the tabulation of the data is done in the form of spreadsheets or computer software, we use several different plots to describe it. Data Presentation Qualitative Data Quantitative Data Dot Plot Summary Table Bar Graph Pie Chart Pareto Diagram Stem-&-Leaf Display Histogram Describing Qualitative Data: Some key terms: • A class is one of the categories into which qualitative data can be classified. • The class frequency is the number of observations in the data set falling into a particular class. • The class relative frequency is the class frequency divided by the total numbers of observations in the data set. • The class percentage is the class relative frequency multiplied by 100. Summary Table • This table lists the different categories /classes and the corresponding number of elements for each category. • The number is obtained by tallying responses. • Sometimes these numbers may be represented as percentages Major Frequency/Count Accounting 50 Business 30 Economics 20 Total 100 Bar Graph and Pareto Diagram 60 • A bar graph is a chart that uses vertical bars to show comparisons among categories. • The X-axis of the chart shows the specific categories being compared, and the Y-axis represents a discrete value. • Each bar is of equal width. • The heights of the bar may show the frequency or relative frequency (in %) • A pareto diagram is when the bars are is descending order. Pie Chart • This chart gives a breakdown of all the categories by dividing the circle in terms of the angle proportional to the frequencies in each category. • Its utility mainly lies in showing relative differences among categories. Bar Graph 50 40 30 20 10 0 Business 60 50 Economics Accounting Pareto Diagram 40 30 20 10 0 Accounting Business Economics Pie Chart Accounting Business 20% 50% 30% Economics Describing Quantitative Data: Consider the data set of pulse rates, in beats per minute, for a group of 30 students. 68 60 72 56 76 68 64 80 72 88 76 80 68 80 76 84 92 64 68 80 56 72 72 64 68 68 60 76 84 72 Dot Plot: 1. Horizontal axis is a scale for the quantitative variable. 2. The numerical value of each measurement is represented on the horizontal scale by a dot. 50 55 60 65 70 75 80 85 90 Stem and Leaf plot: 1. Each observation is divided into the stem and leaf of a quantitative variable. 2. The stem is usually the ten’s place or a combination of ten’s and hundredth's place of the value. 3. The leaf comprises the units place and are placed in ascending order alongside each other. This facilitates ordering the data in ascending order. Below is the stem and leaf for the same data used in the dot plot. Stem Leaf 60 5 6 7 8 9 6 0 2 0 2 6 0 4 4 4 8 8 8 8 8 8 2 2 2 2 6 6 6 6 0 0 0 4 4 8 Histogram: 1. The possible numerical values of the quantitative variable are partitioned into class intervals, where each interval has the same width. 2. These intervals form the scale of the horizontal axis. 3. The frequency or relative frequency of observations in each class interval is determined. 4. A horizontal bar is placed over each class interval, with height equal to either the class frequency or class relative frequency. 5. Each bar is immediately adjacent to each other. The histogram for the pulse rate data is shown: Numerical Measures of Central Tendency The central tendency of the set of measurements–that is, the tendency of the data to cluster, or center, about certain numerical values; usually the Mean, Median or Mode. Mean: 1. 2. 3. 4. Most common measure of central tendency Acts as ‘balance point’ Affected by extreme values (‘outliers’) Denoted by ̅ Formula: ̅ = ∑ ⋯ = Where n= No. of observations & the ! " ! " # $ Example: Mean pulse rate = (56+56+60+……+88+92)/30= 72.1333 Median: 1. The median is a measure of central tendency but it is a positional value. 2. When the data is ordered in ascending order, the median is the mid point of the data set. 3. The position of the Median is found at % & .Here however two situations arise. a. When n is odd: Ex: n=9 , then position of Median is at 5= (9+1)/2 b. When n is even: Ex: n=10, then position of Median is at 5.5 = (10+1)/2 i.e. The Median is the average of the 5th and 6th value in the ordered data set. 4. The median is not affected by extreme values. Example: Raw Data: 24.1 22.6 21.5 23.7 22.6 Ordered: 21.5 22.6 22.6 23.7 24.1 Position of Median: 1 2 3 4 5 Median=22.6 since n=5, (n+1)/2=3 and n is odd. Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position of Median: 1 2 3 4 5 6 Median=(7.7+8.9)/2=8.3 since n=6, (n+1)/2=3.5 and n is even. Quartiles: Just like the median above, where we find the position by splitting the data into 2 parts, for quartiles we split the data into four parts. • First Quartile (Q1): Median of the first half of the data • Third Quartile (Q3): Median of the second half of the data • Second Quartile (Q2) is the same as the median. Mode: 1. The mode is the value with the highest observed frequency. 2. It is not affected by extreme values. 3. In some cases data may have multiple modes. A mode need not necessarily be unique. Example: Raw Data: 68 60 76 68 64 80 72 76 92 68 56 72 68 60 84 72 56 88 76 80 68 80 84 64 80 72 64 68 76 72 The value with the highest frequency, i.e. the value that appears most times in the data is 68. This can be verified from the dot plot on a previous slide. Uses of Mode: We can use the mode in various entrepreneurial scenarios where sales is considered. The most worn shoe size (because most people have mid-sized feet, it would be redundant to produce shoes that are too large or too small. Effect of Linear Transformation • Suppose every observation is multiplied by a fixed constant. Then median of transformed observations is the median of the original observations times that same constant. mean of transformed observations is the mean of the original observations times that same constant. Data: 10, 13, 18, 22, 29 Mean = 18.40. Median = 18. Suppose transformed data = (-3)*original data. So transformed data: -30, -39, -54, -66, -87 Mean = (-3)*18.40 = -55.20. Median = (-3)*18 = -54. 16 16 Effect of Linear Transformation • Suppose a fixed constant is added to (or subtracted from) each observation. Then median of transformed observations is the median of the original observations plus (or minus) that same constant. mean of transformed observations is the mean of the original observations plus (or minus) that same constant. Data: 10, 13, 18, 22, 29 Mean = 18.40. Median = 18. Suppose transformed data = original data + 2.5. Hence transformed data: 12.5, 15.5, 20.5, 24.5, 31.5 Mean = 18.40 + 2.5 = 20.90. Median = 18 + 2.5 = 20.50. 17 17 Spread of a Distribution Are the values concentrated around the center of the distribution or they are spread out? Range, Interquartile Range, Variance, Standard Deviation. Note: Variance and standard deviation are more appropriate when the distribution is symmetric. 18 18 Range • Range of the data is defined as the difference between the maximum and the minimum values. • Data: 23, 21, 67, 44, 51, 12, 35. Range = maximum – minimum = 67 – 12 = 55. • Disadvantage: A single extreme value can make it very large, giving a value that does not really represent the data overall. On the other hand, it is not affected at all if some observation changes in the middle. 19 19 Interquartile Range (IQR) • What is IQR? IQR = Third Quartile (Q3) – First Quartile (Q1). • What are quartiles? Recall: Median divides the data into 2 equal halves. The first quartile, median and the third quartile divide the data into 4 roughly equal parts. 20 20 Quartiles • The first quartile (Q1, lower quartile) is that value which is larger than 25% of observations, but smaller than 75% of observations. • The second quartile (Q2) is the median, which is larger than 50% of observations, but smaller than 50% of observations. • The third quartile (Q3, upper quartile) is that value which is larger than 75% of observations, but smaller than 25% of observations. • Obviously, Q1 < Q2 (= median) < Q3. • How to compute the quartiles? We shall use TI 83/84 Plus. 21 21 IQR vs. Range • IQR is a better summary of the spread of a distribution than the range because it has some information about the entire data, where as range only has information on the extreme values of the data. • IQR is less outlier-sensitive than range. 22 22 Outlier-sensitivity • Data: 10, 13, 17, 21, 28, 32 Without the outlier • IQR = 15 Range = 22 • Data: 10, 13, 17, 21, 28, 32, 59 With the outlier • IQR = 19 Range = 49 Conclusion: IQR is less outlier-sensitive than range. 23 23 Variance and Standard Deviation • The sample variance (s2) is defined as: 1 2 s = ( x1 − x ) 2 + L + ( xn − x ) 2 . n −1 [ ] • Subtract the mean from each value, square each difference, add up the squares, divide by one fewer than the sample size. • The sample standard deviation (s), is the positive square root of sample variance, i.e. s=+ s . 2 24 24 Variance and Standard Deviation • Larger the variance (and standard deviation) more dispersed are the observations around the mean. • The unit of variance is square of the unit of the original data, whereas standard deviation has the same unit as the original data. • Both variance and standard deviation are more appropriate for symmetric distributions. 25 25 Standard Deviation: An Example Data: 3, 12, 8, 9, 3 (n=5 in this case) Mean = (3+12+8+9+3)/5 = 35/5 =7. Data Deviations from mean Squared Deviations -----------------------------------------------------------------------------3 3 – 7 = -4 (-4)x(-4) =16 12 12 – 7 = 5 5 x 5 =25 8 8–7= 1 1x1= 1 9 9–7= 2 2x2= 4 3 3 – 7 = -4 (-4)x(-4) =16 -----------------------------------------------------------------------------Total = 62 Now divide by n-1=4: s2 = 62/4 = 15.50. s = √15.5 = 3.94. Answer: The standard deviation in this example is 3.94 and the variance is 15.50. 26 26 Effect of Linear Transformation • Suppose every observation is multiplied by a fixed constant. Then range/IQR/standard deviation of transformed observations is the range/IQR/standard deviation of the original observations times the absolute value of that same constant. variance of transformed observations is the variance of the original observations times the square of that same constant. Temperature data (in F): 10, 13, 18, 22, 29 Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2. Suppose transformed data = (-3)*original data. So transformed data (in F): -30, -39, -54, -66, -87 Range = |-3|*19 = 57 F, IQR = |-3|*14 = 42 F, s = |-3|* 7.5 = 22.50 F, s2 = (-3)2*56.25 = 506.25 F2. 27 27 Effect of Linear Transformation • Suppose a fixed constant is added to (or subtracted from) each observation. Then range/IQR/standard deviation/variance of transformed observations remains the same as that of the original observations. Temperature data (in F): 10, 13, 18, 22, 29 Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2. Suppose transformed data = original data + 2.5. Hence transformed data (in F): 12.5, 15.5, 20.5, 24.5, 31.5 Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2. 28 28 Chebyshev’s rule % *+ For any distribution at least 1 − of the observations will fall within k standard deviations of mean, where , ≥ 1. • Chebyshev’s rule is for any distribution, whereas the empirical rule is valid only for approximately symmetric unimodal (mound-shaped) distribution. • If k=1, not much information is available from Chebyshev’s rule. • According to Chebyshev at least 75% observations fall within 2 standard deviations of mean. • According to Chebyshev at least 88.9% of observations fall within 3 standard deviations of mean. 29 Empirical rule For approximately symmetric unimodal (bellshaped/mound shaped) distribution • Approximately 68% of observations fall within 1 standard deviation of mean. • Approximately 95% of observations fall within 2 standard deviations of mean. • Approximately 99.7% of observations fall within 3 standard deviations of mean. 30 Empirical rule 31 Empirical rule 32 Box Plot Box plot is another graphical representation of quantitative data using the following 5 number summary: 1. Minimum Value, 2. Lower Quartile, 3. Median (the middle value), 4. Upper Quartile, 5. Maximum Value. NOTE: Data must be ordered from lowest value to highest value before finding the 5 number summary. 33 Box Plots • Are a representation of the five number summary (Minimum, Maximum, Median, Lower Quartile, Upper Quartile). • Half the data are in the box • One-quarter of the data are in each whisker. • If one part of the plot is long, the data are skewed. • Box-plot is very useful for comparing distributions • This box plot indicates data are skewed to the left. 34 Box Plot • Box Plot is a pictorial representation of the 5-number summary. 35 Outliers • Any observation farther than 1.5 times IQR from the closest boundary of the box is an outlier. • If it is farther than 3 times IQR, it is an extreme outlier, otherwise a mild outlier. • One can also indicate the outliers in a box plot, by drawing the whiskers only up to 1.5 times IQR on both sides, and indicating outliers with stars or crosses (or other symbols). 36 An example Suppose min = 2, Q1 = 18, median = 20, Q3 = 22, max = 35. Which of the following observations are outliers? Lower Fence= Q1-1.5*IQR= 18-1.5(22-18)=12 A. 10 Upper Fence= Q3+1.5*IQR=22+1.5(22-18)=28 B. 15 Note: All observations below the lower fence and above the higher fence are considered to be C. 25 outliers. D. 30 37 Histogram vs. Box plot • Both histogram and box plot capture the symmetry or skewness of distributions. • Box plot cannot indicate the modality of the data. • Box plot is much better in finding outliers. • The shape of histogram depends to some extent on the choice of bins. 38 Comparing Distributions We can compare between distributions of various data-sets using Box Plots (or the 5-Number Summary), Histograms. We shall first compare distributions using box plots. Which type of car has the largest median Time to accelerate? A. B. C. D. E. upscale sports small large family 40 Which type of car has the smallest median time value? A. B. C. D. E. upscale sports small Large Luxury 41 Which type of car always take less than 3.6 seconds to accelerate? A. B. C. D. E. upscale sports small Large Luxury 42 Which type of car has the smallest IQR for Time to accelerate? A. B. C. D. E. upscale sports small Large Luxury 43 What is the shape of the distribution of acceleration times for luxury cars? A. Left skewed B. Right skewed C. Roughly symmetric D. Cannot be determined from the information given. 44 What percent of luxury cars accelerate to 30 mph in less than 3.5 seconds? A. B. C. D. E. Roughly 25% Exactly 37.5% Roughly 50% Roughly 75% Cannot be determined from the information given 45 What percent of family cars accelerate to 30 mph in less than 3.5 seconds? A. B. C. D. E. Less than 25% More than 50% Less than 50% Exactly 75% None of the above 46 Z-Scores How to compare apples with oranges? • A college admissions committee is looking at the files of two candidates, one with a total SAT score of 1500 and another with an ACT score of 22. Which candidate scored better? • How do we compare things when they are measured on different scales? • We need to standardize the values. 47 How to standardize? • Subtract mean from the value and then divide this difference by the standard deviation. • The standardized value = the z-score value − mean = std .dev. • z-scores are free of units. 48 z-scores: An Example Data: 4, 3, 10, 12, 8, 9, 3 (n=7 in this case) Mean = (4+3+10+12+8+9+3)/7 = 49/7 =7. Standard Deviation = 3.65. Original Value z-score -------------------------------------------------------------4 (4 – 7)/3.65 = -0.82 3 (3 – 7)/3.65 = -1.10 10 (10 – 7)/3.65 = 0.82 12 (12 – 7)/3.65 = 1.37 8 (8 – 7)/3.65 = 0.27 9 (9 – 7)/3.65 = 0.55 3 (3 – 7)/3.65 = -1.10 -------------------------------------------------------------49 Interpretation of z-scores • The z-scores measure the distance of the data values from the mean in the standard deviation scale. • A z-score of 1 means that data value is 1 standard deviation above the mean. • A z-score of -1.2 means that data value is 1.2 standard deviations below the mean. • Regardless of the direction, the further a data value is from the mean, the more unusual it is. • A z-score of -1.3 is more unusual than a z-score of 1.2. 50 How to use z-scores? • A college admissions committee is looking at the files of two candidates, one with a total SAT score of 1500 and another with an ACT score of 22. Which candidate scored better? • SAT score mean = 1600, std dev = 500. • ACT score mean = 23, std dev = 6. • SAT score 1500 has z-score = (1500-1600)/500 = -0.2. • ACT score 22 has z-score = (22-23)/6 = -0.17. • ACT score 22 is better than SAT score 1500. 51 Which is more unusual? A. A 58 in tall woman z-score = (58-63.6)/2.5 = -2.24. B. A 64 in tall man z-score = (64-69)/2.8 = -1.79. C. They are the same. Heights of adult women have mean of 63.6 in. std. dev. of 2.5 in. Heights of adult men have mean of 69.0 in. std. dev. of 2.8 in. 52 Using z-scores to solve problems An example using height data and U.S. Marine and Army height requirements Question: Are the height restrictions set up by the U.S. Army and U.S. Marine more restrictive for men or women or are they roughly the same? 53 Data from a National Health Survey Heights of adult women have – mean of 63.6 in. – standard deviation of 2.5 in. Heights of adult men have – mean of 69.0 in. – standard deviation of 2.8 in. Height Restrictions Men Minimum U.S. Army U.S. Marine Corps 60 in 64 in Women Minimum 58 in 58 in 54 Heights of adult men have – mean of 69.0 in. – standard deviation of 2.8 in. Men Minimum U.S. Army U.S. Marine Heights of adult women have – mean of 63.6 in. – standard deviation of 2.5 in. Women minimum 60 in 58 in z-score = -3.21 z-score = -2.24 Less restrictive More restrictive 64 in 58 in z-score = -1.79 z-score = -2.24 More restrictive Less restrictive 55 Effect of Standardization • Standardization into z-scores does not change the shape of the histogram. • Standardization into z-scores changes the center of the distribution by making the mean 0. • Standardization into z-scores changes the spread of the distribution by making the standard deviation 1. 56 Z-score and Empirical Rule When data are bell shaped, the z-scores of the data values follow the empirical rule. 57 Outlier detection with z-score • Empirical Rule tells us that if data are mound-shaped distributed, then almost all the data-points are within plus minus 3 standard deviations from the mean. So an absolute value of z-score larger than 3 can be considered as an outlier. 58 2004 Olympics Women’s Heptathlon Austra Skujyte (Lithunia) Shot Put = 16.40m, Long Jump = 6.30m. Mean Shot Put Long Jump 13.29m 6.16m 1.24m 0.23m 28 26 Carolina Kluft (Sweden) Shot Put = 14.77m, Long Jump = 6.78m. (all contestant) Std.Dev. n 59 Which performance was better? A. Skujyte’s shot put, z-score of Skujyte’s shot put = 2.51. B. Kluft’s long jump, z-score of Kluft’s long jump = 2.70. C. Both were same. Mean Shot Put Long Jump 13.29m 6.16m 1.24m 0.23m 28 26 (all contestant) Std.Dev. n 60 Based on shot put and long jump whose performance was better? A. Skujyte’s, z-score: shot put = 2.51, long jump = 0.61. Total z-score = (2.51+0.61) = 3.12. B. Kluft’s, z-score: shot put = 1.19, long jump = 2.70. Total z-score = (1.19+2.70) = 3.89. C. Both were same. 61