Download Very non-resistant

Stat 31, Section 1, Last Time • • • • Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Bivariate display: scatterplot • Course Organization & Website https://www.unc.edu/%7Emarron/UNCstat31-2005/Stat31sec1Home.html Exploratory Data Analysis 4 “Time Plots”, i.e. “Time Series: Idea: when time structure is important, plot variable as a function of time: variable time Often useful to “connect the dots” Class Time Series Example Monthly Airline Passenger Numbers https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Done.xls • • • Increasing Trend (long term growth, over years) Increasing Variation (appears proportional to trend) “Seasonal Effect” - 12 Month Cycle (Peak in summer, less in winter) Airline Passengers Example Interesting variation: log transformation • Stabilizes variation • Since log of product is sum • Shows changing variation prop’l to trend • Log10 is “most interpretable” (log10(1000) = 3, …) • Generally useful trick (there are others) Airline Passengers Example A look under the hood https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Raw.xls • • • • • • Use Chart Wizard Chart Type: Line (or could do XY) Use subtype for points & lines Use menu for first log10 Although could just type it in Drag down to repeat for whole column Time Series HW HW: 1.36, 1.37 • Use EXCEL Exploratory Data Analysis 5 Numerical Summaries of Quant. Variables: Idea: Summarize distributional information (“center”, “spread”, “skewed”) In Text, Sec. 1.2 for data x1 , x2 ,..., xn (subscripts allow “indexing numbers” in list) Numerical Summaries A. “Centers” (note there are several) 1. “Mean” = Average = x1    xn  n n  n1  xi  x i 1 • Greek letter “Sigma”, for “sum” In EXCEL, use “AVERAGE” function Numerical Summaries of Center 2. “Median” = Value in middle (of sorted list) Unsorted E.g: Sorted E.g: 3 0 1 1 27 “in middle”? (no) 2 better “middle”! 2 3 0 27 EXCEL: use function “MEDIAN” Difference Betw’n Mean & Median Symmetric Distribution: Essentially no difference Right Skewed: 50% area 50% area M x bigger since “feels tails more strongly” Difference Betw’n Mean & Median Outliers (unusual values): Nice Web Example: http://www.stat.sc.edu/~west/applets/box.html • Mean feels outliers much more strongly • Leaves “range of most of data” • Good notion of “center”? (perhaps not) • Median affected very minimally • Robustness Terminology: Median is “resistant to the effect of outliers” Difference Betw’n Mean & Median A more flexible web example: http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html • Get various dist’ns, by manipulating bar heights • See Mean, Median and more • Similar for symmetric distributions • Very different when skewed • “Big Gap”, can make median jump a lot • But mean is less sensitive (more “continuous”) Numerical Centerpoint HW HW: 1.49 a (but make histograms), b • Use EXCEL Numerical Summaries (cont.) A. “Spreads” (again there are several) 1. Range = biggest xi - smallest xi range Problems: • Feels only “outliers” • Not “bulk of data” • Very non-resistant to outliers Numerical Summaries of Spread 2. Variance = s  2 n  x1  x      x1  x  2 2 n 1   xi  x   i 1 2 n 1 = “average squared distance to EXCEL: x“ VAR Drawback: units are wrong e. g. For xi in feet  s 2 is in square feet Numerical Summaries of Spread 3. Standard Deviation  s  s EXCEL: 2 STDEV • Scale is right • But not resistant to outliers • Will use quite a lot later (for reasons described later) Interactive View of S. D. Revisit flexible web example: http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html • Note SD range centered at mean • Can put SD “right near middle” (densely packed data) • Can put SD at “edges of data” (U shaped data) • Can put SD “outside of data” (big spike + outlier) • But generally “sensible measure of spread” Variance – S. D. HW HW: for both data sets in 1.49, find the: i. Variance (698.9, 1079) ii. Standard Deviation • Use EXCEL (26.4, 32.9) Numerical Summaries of Spread 3. Interquartile Range = IQR Based on “quartiles”, Q1 and Q3 (idea: shows where are 25% & 75% “through the data”) 25% 25% 25% 25% Q1 IQR = Q3 – Q1 Q2 = median Q3 Quartiles Example Revisit flexible web example: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls • Right skewness gives: – Median < Mean (mean “feels farther points more strongly”) – Q1 near median – Q3 quite far (makes sense from histogram) Quartiles Example A look under the hood: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Raw.xls • Can compute as separate functions for each • Or use: Tools  Data Analysis  Descriptive Stats • Which gives many other measures as well • Use “k-th largest & smallest” to get quartiles 5 Number Summary 1. 2. 3. 4. 5. Minimum Q1 - 1st Quartile Median Q3 - 3rd Quartile Maximum Summarize Information About: a) b) c) d) Center Spread Skewness Outliers - from 3 from 2 & 4 (maybe 1 & 6) from 2, 3 & 4 from 1 & 5 5 Number Summary How to Compute? https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls • EXCEL function QUARTILE • “One stop shopping” • IQR seems to need explicit calculation Rule for Defining “Outliers” Caution: There are many of these Textbook version: Above Q3 + 1.5 * IQR Below Q1 – 1.5 * IQR For stamps data: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls – No outliers at “low end” – Some that “high end” Box Plot • Additional Visual Display Device • Again legacy from pencil & paper days • Not supported in EXCEL • We will skip 5 Number Sum. & Outliers HW 1.49 c, d 1.46 and add: (d) How much does the mean change if you omit Montana and Wyoming?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Very non-resistant