Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 31, Section 1, Last Time • • • • Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Bivariate display: scatterplot • Course Organization & Website https://www.unc.edu/%7Emarron/UNCstat31-2005/Stat31sec1Home.html Exploratory Data Analysis 4 “Time Plots”, i.e. “Time Series: Idea: when time structure is important, plot variable as a function of time: variable time Often useful to “connect the dots” Class Time Series Example Monthly Airline Passenger Numbers https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Done.xls • • • Increasing Trend (long term growth, over years) Increasing Variation (appears proportional to trend) “Seasonal Effect” - 12 Month Cycle (Peak in summer, less in winter) Airline Passengers Example Interesting variation: log transformation • Stabilizes variation • Since log of product is sum • Shows changing variation prop’l to trend • Log10 is “most interpretable” (log10(1000) = 3, …) • Generally useful trick (there are others) Airline Passengers Example A look under the hood https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Raw.xls • • • • • • Use Chart Wizard Chart Type: Line (or could do XY) Use subtype for points & lines Use menu for first log10 Although could just type it in Drag down to repeat for whole column Time Series HW HW: 1.36, 1.37 • Use EXCEL Exploratory Data Analysis 5 Numerical Summaries of Quant. Variables: Idea: Summarize distributional information (“center”, “spread”, “skewed”) In Text, Sec. 1.2 for data x1 , x2 ,..., xn (subscripts allow “indexing numbers” in list) Numerical Summaries A. “Centers” (note there are several) 1. “Mean” = Average = x1 xn n n n1 xi x i 1 • Greek letter “Sigma”, for “sum” In EXCEL, use “AVERAGE” function Numerical Summaries of Center 2. “Median” = Value in middle (of sorted list) Unsorted E.g: Sorted E.g: 3 0 1 1 27 “in middle”? (no) 2 better “middle”! 2 3 0 27 EXCEL: use function “MEDIAN” Difference Betw’n Mean & Median Symmetric Distribution: Essentially no difference Right Skewed: 50% area 50% area M x bigger since “feels tails more strongly” Difference Betw’n Mean & Median Outliers (unusual values): Nice Web Example: http://www.stat.sc.edu/~west/applets/box.html • Mean feels outliers much more strongly • Leaves “range of most of data” • Good notion of “center”? (perhaps not) • Median affected very minimally • Robustness Terminology: Median is “resistant to the effect of outliers” Difference Betw’n Mean & Median A more flexible web example: http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html • Get various dist’ns, by manipulating bar heights • See Mean, Median and more • Similar for symmetric distributions • Very different when skewed • “Big Gap”, can make median jump a lot • But mean is less sensitive (more “continuous”) Numerical Centerpoint HW HW: 1.49 a (but make histograms), b • Use EXCEL Numerical Summaries (cont.) A. “Spreads” (again there are several) 1. Range = biggest xi - smallest xi range Problems: • Feels only “outliers” • Not “bulk of data” • Very non-resistant to outliers Numerical Summaries of Spread 2. Variance = s 2 n x1 x x1 x 2 2 n 1 xi x i 1 2 n 1 = “average squared distance to EXCEL: x“ VAR Drawback: units are wrong e. g. For xi in feet s 2 is in square feet Numerical Summaries of Spread 3. Standard Deviation s s EXCEL: 2 STDEV • Scale is right • But not resistant to outliers • Will use quite a lot later (for reasons described later) Interactive View of S. D. Revisit flexible web example: http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html • Note SD range centered at mean • Can put SD “right near middle” (densely packed data) • Can put SD at “edges of data” (U shaped data) • Can put SD “outside of data” (big spike + outlier) • But generally “sensible measure of spread” Variance – S. D. HW HW: for both data sets in 1.49, find the: i. Variance (698.9, 1079) ii. Standard Deviation • Use EXCEL (26.4, 32.9) Numerical Summaries of Spread 3. Interquartile Range = IQR Based on “quartiles”, Q1 and Q3 (idea: shows where are 25% & 75% “through the data”) 25% 25% 25% 25% Q1 IQR = Q3 – Q1 Q2 = median Q3 Quartiles Example Revisit flexible web example: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls • Right skewness gives: – Median < Mean (mean “feels farther points more strongly”) – Q1 near median – Q3 quite far (makes sense from histogram) Quartiles Example A look under the hood: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Raw.xls • Can compute as separate functions for each • Or use: Tools Data Analysis Descriptive Stats • Which gives many other measures as well • Use “k-th largest & smallest” to get quartiles 5 Number Summary 1. 2. 3. 4. 5. Minimum Q1 - 1st Quartile Median Q3 - 3rd Quartile Maximum Summarize Information About: a) b) c) d) Center Spread Skewness Outliers - from 3 from 2 & 4 (maybe 1 & 6) from 2, 3 & 4 from 1 & 5 5 Number Summary How to Compute? https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls • EXCEL function QUARTILE • “One stop shopping” • IQR seems to need explicit calculation Rule for Defining “Outliers” Caution: There are many of these Textbook version: Above Q3 + 1.5 * IQR Below Q1 – 1.5 * IQR For stamps data: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls – No outliers at “low end” – Some that “high end” Box Plot • Additional Visual Display Device • Again legacy from pencil & paper days • Not supported in EXCEL • We will skip 5 Number Sum. & Outliers HW 1.49 c, d 1.46 and add: (d) How much does the mean change if you omit Montana and Wyoming?