Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 31, Section 1, Last Time • Time series plots • Numerical Summaries of Data: – Center: Mean, Medial – Spread: Range, Variance, S.D., IQR • 5 Number Summary & Outlier Rule • Course Organization & Website https://www.unc.edu/%7Emarron/UNCstat31-2005/Stat31sec1Home.html Comments From Grader I encountered some problems in the grading. These problems are: 1. the homework pages are not stapled together. 2. the answers are not the same order as the questions. 3. the results, especially in excel tables, are not highlighted. Could you please emphasize the above problems in your class? If the students follow the rules, the grading will be much easier. In the grading of homework #2, I also hope that you can allow me to enforce the rules by giving zero points. Linear Transformations Idea: What happens to data & summaries, when data are: “shifted and scaled” i.e. “panned and zoomed” Math: x1 ,..., xn Shifted by a Scaled by b ax1 b,..., axn b Linear Transformations Effect on linear summaries: • x and M Centerpoints, “follow data”: • Spreads, s ax b, aM b . and IQR “feel scale, not shift”: as, aIQR . Most Useful Linear Transfo. “Standardization” Goal: put data sets on “common scale” Approach: 1. Subtract Mean x, to “center at 0” 2. Divide by S.D. s, to “give common SD = 1” Standardization Result is called “z-score”: Note that xi x zi s szi xi x , x szi xi Thus zi is interpreted as: “number of SDs from the mean” Standardization Example Buffalo Snowfall Data: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg7Done.xls • Standardized data have same (EXCEL default) histogram shape as raw data. (Since axes and bin edges just follow the transformation) • i.e. “shape” doesn’t depend on “scaling” Standardization Example A look under the hood: https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg7Raw.xls Compute AVERAGE and SD 1. Standardize by: a. Create Formula in cell B2 b. Drag downwards c. Keep Mean and SD cells fixed using $s 3. Check stand’d data have mean 0 & SD 1 note that “8.247E-16 = 0” Standardization HW C6: For the 18 female scores in 1.49, use EXCEL to: a. Give the list of standardized scores b. Give the Z-score for: (i) the mean (0) (ii) the median (-0.0967) (iii) the smallest (-1.52) (iv) the largest (2.23) 1.79 Modelling Distributions Text: Section 1.3 Idea: Approximate histograms by: an “idealized curve” i.e. a “density curve” that represents the population Idealized Curve Example Recall Hidalgo Stamps Data, Shifting Bin Movie (made # modes change): https://www.unc.edu/~marron/UNCstat31-2005/StampsHistLoc.mpg Add idealized curve: https://www.unc.edu/~marron/UNCstat31-2005/StampsHistLocKDE.mpg Note: “population curve” shows why histogram modes appear and disappear Interpretation of Density Areas under density curve, give “relative frequency” a b a&b Proportion of data between = Area under f (x ) = b = a f ( x)dx Interpretation of Density Note: Total Area under density = 1 (since relative freq. of everything is 1) HW: • 1.78 (b: 0.8), 1.79 Work with pencil and paper, not EXCEL Most Useful Density “Normal Curve” = “Gaussian Density” • Shape: “like a mound” • E.g. of “sand dumped from a truck” • Older, worse, description: “bell shaped” Normal Density Example Winter Daily Maximum Temperatures in Melbourne, Australia https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg9Done.xls Notes: • Top Histogram is “mound shaped” • Plus “small scale random variation” • So model with “Normal Density”? Normal Density Curves Note: there is a family of normal curves, indexed by: i. “Center”, i.e. Mean = ii. “Spread”, i.e. Stand. Deviation = Terminology: & are called “parameters” Greek “mu” Greek “sigma” ~ s Family of Normal Curves Think about: • “Shifts” (pans) indexed by • “Scales” (zooms) indexed by Nice interactive graphical example: http://www.stat.sc.edu/~west/applets/normaldemo1.html (note area under curve is always 1) Normal Curve Mathematics The “normal , density curve” is: 1 f ( x) e 2 1 x 2 2 usual “function” of x circle constant = 3.14… natural number = 2.7… Normal Curve Mathematics Main Ideas: • • Basic shape is: e “Shifted to mu”: 1 x2 2 e 1 x 2 2 1 x 2 2 • “Scaled by sigma”: • Make Total Area = 1: divide by • f ( x ) 0 as x , but never 0 e 2 Idea: Normal Model Fitting Choose , to give: “good” fit to data x1 ,..., xn . Approach: IF the distribution is “mound shaped” & outliers are negligible THEN a “good” choice of normal model is: x, s Normal Fitting Example Revisit Melbourne Daily Max Temps https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg9Done.xls x, s • Fit curve, using • “Visually good” approximation Normal Fitting Example A look under the hood https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg9Done.xls • • • • Use chosen (not default) histogram bins for nice comparison bins Use longer range to avoid the “More” bin Can compute with density formula (Two steps, in cols F and G) Or use NORMDIST function (col J, check same as col G) Normal Curve HW C7: A study of distance runners found a mean weight of 63.1 kg, with a standard deviation of 4.8 kg. Assuming that the distribution of weights is normal, use EXCEL to draw the density curve of the weight distribution. 2 Views of Normal Fitting 1. “Fit Model to Data” Choose x & s. 2. “Fit Data to Model” First Standardize Data Then use Normal 0, 1. Note: same thing, just different rescalings (choose scale depending on need) Normal Distribution Notation The “normal distribution, with mean & standard deviation ” is abbreviated as: s N , Interpretation of Z-scores Idea: Z-scores are on N 0, 1 scale, so use areas to interpret Important Areas: • Within 1 sd of mean 68% “the majority” Interpretation of Z-scores 2. Within 2 sd of mean 95% “really most” 3. Within 3 sd of mean 99.7% “almost all” Interpretation of Z-scores Interactive Version (used for above pics) From Webster West’s Website: http://www.stat.sc.edu/~west/applets/empiricalrule.html Interpretation of Z-scores Summary: These relations are called the “68 - 95 - 99.7 % HW: 1.82 (a: 234-298, 1.83 Rule” b: 234, 298),