Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Choice modelling wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Forecasting wikipedia , lookup
Regression analysis wikipedia , lookup
Data assimilation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Robust statistics wikipedia , lookup
Review of Midterm math2200 Data • W’s • Subjects • Variables – Categorical versus quantitative One categorical variable • Graphs: – Bar chart – Pie chart • Numerical summary: – Frequency table – Relative frequency table Two categorical variables • Conditional and marginal distribution • Graphs – Segmented bar charts – Side-by-side bar charts – Side-by-side pie charts • Numerical summary – Contingency table – table percentage, row percentage, column percentage Problems 28 (page 148) • Birth order related to major? – – – – What percent of these students are oldest or only children? (113/223) What percent of Humanities majors are oldest children? (15/43) What percent of oldest children are Humanities students? (15/113) What percent of the students are oldest children majoring in the Humanities? (15/223) Major Birth Order 1 2 3 4+ Total Math/Science 34 14 6 3 57 Agriculture 52 27 5 9 93 Humanities 15 17 8 3 43 Other 12 11 1 6 30 total 113 69 20 21 223 Problems 30 (page 148) • • What is the marginal distribution of majors? Math/Science Agriculture Humanities 57 (25.6%) 93 (41.7%) 43 (19.3%) 30 (13.5%) Total Other 223 What is the conditional distribution of majors for the oldest children? Math/Science Agriculture Humanities 34 (30.1%) 52 (46.0%) 15 (13.3%) 12 (10.6%) Major Total Other 113 Birth Order 1 2 3 4+ Total Math/Science 34 14 6 3 57 Agriculture 52 27 5 9 93 Humanities 15 17 8 3 43 Other 12 11 1 6 30 total 113 69 20 21 223 Simpson’s Paradox • Problem 3.38: Two delivery services Delivery Service Pack Rats Boxes R Us Type of Service Number of deliveries Number of late packages Regular 400 12 (3%) Overnight 100 16(16%) Regular 100 2(2%) Overnight 400 28 (7%) Overall percentage of late deliveries 5.60% 6% One quantitative variable • Graphs – Histogram – Boxplot • Qualitative summary – # of modes – Symmetric? Transformation? – Outliers? • Numerical summary – Five-number summary – Center: mean versus median – Spread: sd versus IQR Problem 32: Pay • The 1999 National Occupational Employment and Wage Estimates for management Occupations – For chief executives • Mean = $48.67/hour • Median = $52.08/hour – For General and Operations Managers • Mean = $31.69/hour • Median = $27.23/hour – Are these wage distributions likely to be symmetric, skewed to the left or skewed to the right? Shifting and rescaling Location shift rescale min x x Q1 x x median x x Q3 x x max x x mean x x spread variance x Standard deviation x IQR x range x Problem 4.42: Job Growth • 20 cities’ job growth rates predicted by Standard & Poor’s DRI in 1996 • Are the mean and median very • Frequency 3 4 5 Histogram of Predicted Job Growth Rates 0 1 2 • 1 2 3 Growth rate (%) 4 5 • • different? Which one is more appropriate? – Mean (2.37%) or median (2.235%)? – SD (0.425%) or IQR (0.515%)? If we subtract from these growth rates the predicted U.S. average growth rate of 1.20%, how would this change the above summary statistics? If we omit Las Vegas (growth rate=3.72%) from the data, how would you expect the above summary statistics to change? How to summarize the distribution of the data? One quantitative variable and one categorical variable • Comparing groups – with histogram, boxplot, stem-and-leaf plot – Transformation when spread is too different across groups Normal model • Z-score and standard normal • Nearly normal condition – Normal probability plot • Four types of problems – Given parameters and data values (or z-score), ask for probabilities – Given parameters and probabilities, ask for data values (or z-score) – Given probabilities and data values (or z-score), ask for parameters – Given probabilities, data values (or z-score) and one parameter, ask for the other parameter Problem 22: Winter Olympic 2002 speed skating • Top 25 men’s and 25 women’s 500-m speed skating times – Mean = 73.46 – Sd = 3.33 • If the Normal model is appropriate, what percent of the times should be within 1.67 seconds of 73.46? – Solution 1: 1.67=0.5*sd, Normcdf(-0.5,0.5,0,1) – Solution 2: Normcdf(72.19, 75.13, 73.46, 3.33) • In the data, only 6% are within that range. Why are the percentages so different? Problem 39: assembly time • Only 25% of the company’s customers succeeded in building the desk under an hour • 5% said it took them over 2 hours • Assume that consumer assembly time follows a Normal model • Mean = ? , SD= ? – Z-score corresponding to 25%: • (1- mean)/ SD = invNorm(0.25,0,1) = 0.6744897495 – Z-score corresponding to 95%: • (2- mean)/ SD = invNorm(0.95,0,1) = 1.644853626 • Solve the two equations, we have • mean = 1.29 • SD = 0.43 Problem 39: assembly time (cont.) • Mean =1.29, sd=0.43 • What assembly time should the company quote in order that 60% of customers succeed in finishing the desk by then? – invNorm(0.6,1.29,0.43) Problem 39: assembly time (cont.) • Mean =1.29, sd=0.43 • The company wishes to improve the onehour success rate to 60%. If the sd stays the same, what new lower mean time does the company need to achieve? – Z-score = invnorm (0.6,0,1) – Z-score = (1-mean)/sd – Mean = 0.89 Correlation • • • • • • • • Sign of r means? The range of r? X and Y are called uncorrelated if and only if r=0 r(x,y)=r(y,x) No units Effected by shifting or rescaling X, Y or both? Uncorrelated does NOT imply no association Sensitive to outliers (remove a point close to the line fitted through the scatterplot increase or decrease r?) Correlation: Review II 13, Page 264 • What factor most explains differences in Fuel Efficiency among cars? Here’s a correlation matrix exploring that relationship for the car’s Weight, Horsepower, engine size (Displacement), and number of Cylinders. MPG Weight HorsePower Displace ment MPG 1.000 Weight -0.903 1.000 Horse-Power -0.871 0.917 1.000 Displacement -0.786 0.951 0.872 1.000 Cylinders 0.917 0.864 0.940 -0.806 Cylinders 1.000 a) Which factor seems most strongly associated with Fuel Efficiency ? b) What does the negative correlation indicate? c) Explain the meaning of R^2 for that relationship. Matching r and scatterplots • Here are several scatterplots. The calculated correlations are 0.85, 0.87, 0.04 and 0.53. which is which? Linear regression (least squares) • How to calculate the slope? • Given the slope, and standard deviations, how to calculate the correlation? • The line always goes through • Residual = – Overestimation – Underestimation • Causal relationship ? • How to interpret ? Diagnostics of a Linear Model 1. Visual inspection: scatter plot satisfies the “Straight Enough Condition”? Looks okay, 2. Regression: calculate the regression equation, r and R^2. (R^2=r*r gives the percentage of variation of the data explained by the model). R^2 is tiny, say<0.2, a linear model may not be a good choice. 3. Residuals: check the residual plot even when R^2 is large. Bad sign if we see some pattern. The spread of the residuals are supposed to about the same across the X-axis if the linear model is appropriate. (you can either put predicted value or x-variable on x-axis). 4. Re-expression: consider re-expressing the data. If a linear model is not appropriate for the data, And remember to repeat the diagnostics every time after fitting a new linear model on the transformed data. Randomness Simulation Simulation Component ? Response variable? Trial? Example: 11.20 Suppose the chance of passing the driver’s test is 34% the first time and 72% for the subsequent retests. Estimate the percentage of those tested who still do not have a driver’s license after two attemps. Check list • Graphs and plots: bar chart, pie chart, histogram, boxplot (mod boxplot on ti-83), normal probability plot, scatterplot, residual plot How to make ? How to interpret ? • Statistics : mean, medium, min, max, range, quartiles, standard deviation, IQR, correlation coefficient How to calculate ? How to interpret? • Model: normal distribution, linear regression. How to get the parameters ?