Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
McGraw-Hill/Irwin 4 Chapter Descriptive Statistics Numerical Description Central Tendency Dispersion Standardized Data Percentiles, Quartiles, and Box Plots Correlation Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Numerical Description • Three key characteristics of numerical data: Characteristic Interpretation Central Tendency Where are the data values concentrated? What seem to be typical or middle data values? Dispersion How much variation is there in the data? How spread out are the data values? Are there unusual values? Shape Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? 4-2 Central Tendency Six Measures of Central Tendency Statistic Formula Mean 1 n xi n i 1 Median Middle value in sorted array Excel Formula =AVERAGE(Data) =MEDIAN(Data) Pro Con Familiar and uses all the sample information. Influenced by extreme values. Robust when extreme data values exist. Ignores extremes and can be affected by gaps in data values. 4-3 Central Tendency Six Measures of Central Tendency Statistic Mode Midrange Formula Most frequently occurring data value xmin xmax 2 Excel Formula =MODE(Data) =0.5*(MIN(Data) +MAX(Data)) Pro Con Useful for attribute data or discrete data with a small range. May not be unique, and is not helpful for continuous data. Easy to understand and calculate. Influenced by extreme values and ignores most data values. 4-4 Central Tendency Six Measures of Central Tendency Statistic Geometric mean (G) Trimmed mean Formula n x1 x2 ... xn Same as the mean except omit highest and lowest k% of data values (e.g., 5%) Excel Formula Pro Con =GEOMEAN(Data) Useful for growth rates and mitigates high extremes. Less familiar and requires positive data. =TRIMMEAN(Data, Percent) Mitigates effects of extreme values. Excludes some data values that could be relevant. 4-5 Central Tendency Skewness • Compare mean and median or look at histogram to determine degree of skew ness. 4-6 4-6 Dispersion • Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of dispersion: Measures of Variation Statistic Range Formula xmax – xmin n Variance (s2) 4-7 xi x i 1 n 1 Excel Pro Con =MAX(Data)MIN(Data) Easy to calculate Sensitive to extreme data values. =VAR(Data) Plays a key role in Non-intuitive mathematical meaning. statistics. 2 4-7 Dispersion Measures of Variation Statistic Formula Excel Pro Con =STDEV(Data) Most common measure. Uses same units as the raw data ($ , £, ¥, etc.). Non-intuitive meaning. Measures relative variation in percent so can compare data sets. Requires nonnegative data. n Standard deviation (s) Coef-ficient. of variation (CV) 2 x x i i 1 n 1 100 s x None 4-8 Dispersion Measures of Variation Statistic Formula Mean absolute deviation (MAD) n xi x i 1 Excel =AVEDEV(Data) Pro Con Easy to understand. Lacks “nice” theoretical properties. n Standardized Data Chebyshev’s Theorem 4-9 4-9 Standardized Data The Empirical Rule • Are there any unusual values or outliers? 7 8 . . . 48 55 68 91 Unusual Unusual Outliers -19.5 Outliers -5.4 8.6 22.72 36.8 50.9 65.0 4-10 Standardized Data Defining a Standardized Variable • A standardized variable (Z) redefines each observation in terms the number of standard deviations from the mean. Standardization formula for a population: Standardization formula for a sample: xi zi xi x zi s 4-11 Percentiles and Quartiles Percentiles • Percentiles are data that have been divided into 100 groups. • For example, you score in the 83rd percentile on a standardized test. That means that 83% of the testtakers scored below you. • Deciles are data that have been divided into 10 groups. • Quintiles are data that have been divided into 5 groups. • Quartiles are data that have been divided into 4 groups. 4-12 Box Plots • A useful tool of exploratory data analysis (EDA). • Also called a box-and-whisker plot. • Based on a five-number summary: Xmin, Q1, Q2, Q3, Xmax • Consider the following five-number summary : Xmin, Q1, Q2, Q3, Xmax 7 14 19 26 91 4-13 Box Plots • The box plot is displayed visually, like this. • A box plot shows central tendancy, dispersion, and shape. 4-14 Correlation Correlation Coefficient • The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y. Its range is -1 ≤ r ≤ +1. 4-15