* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Class6 - NYU Stern School of Business
Survey
Document related concepts
Transcript
Statistics & Data Analysis Course Number Course Section Meeting Time B01.1305 31 Wednesday 6-8:50 pm CLASS #6 Class #6 Outline Point estimation Confidence interval estimation Determining sample sizes Introduction to Regression and Correlation Analysis Professor S. D. Balkin -- March 5, 2003 -2- Review of Last Class Sampling distributions for sample statistics Professor S. D. Balkin -- March 5, 2003 -3- Review of Notation Population Sample Mean Standard Deviation Professor S. D. Balkin -- March 5, 2003 X s -4- Point and Interval Estimation Chapter 8 Review Basic problem of statistical theory is how to infer a population or process value given only sample data Any sample statistic will vary from sample to sample Any sample statistic will differ from the true, population value Must consider random error in sample statistic estimation Professor S. D. Balkin -- March 5, 2003 -6- Chapter Goals Summarize sample data • Choosing an estimator • Unbiased estimator Constructing confidence intervals for means with known standard deviation Constructing confidence intervals for proportions Determining how large a sample is needed Constructing confidence intervals when standard deviation is not known Understanding key underlying assumptions underlying confidence interval methods Professor S. D. Balkin -- March 5, 2003 -7- Reminder: Statistical Inference Problem of Inferential Statistics: • Make inferences about one or more population parameters based on observable sample data Forms of Inference: • Point estimation: single best guess regarding a population parameter • Interval estimation: Specifies a reasonable range for the value of the parameter • Hypothesis testing: Isolating a particular possible value for the parameter and testing if this value is plausible given the available data Professor S. D. Balkin -- March 5, 2003 -8- Point Estimators Computing a single statistic from the sample data to estimate a population parameter Choosing a point estimator: • What is the shape of the distribution? • Do you suspect outliers exist? • Plausible choices: • • • • Mean Median Mode Trimmed Mean Professor S. D. Balkin -- March 5, 2003 -9- Technical Definitions ESTIMATOR : An estimator ˆ of a parameter is a function of a random sample that yields a point estimate for . An estimator is itself a random variable and therefore it has a theoretic al sampling distributi on. UNBIASED ESTIMATOR : An estimator ˆ that is a function of the sample data is called unbiased for the population parameter if its expected value equals EFFICIENT ESTIMATOR : An estimator is called most efficient for a particular problem if it has the smallest standard error of all possible unbiased estimators Professor S. D. Balkin -- March 5, 2003 - 10 - Example I used R to draw 1,000 samples, each of size 30, from a normally distributed population having mean 50 and standard deviation 10. data.mean = data.median = numeric(0) for(i in 1:1000) { data = rnorm(n=30, mean=50, sd=10) data.mean[i] = mean(data) data.median[i] = median(data) } For each sample the mean and median are computed. Do these statistics appear unbiased? Which is more efficient? Professor S. D. Balkin -- March 5, 2003 - 11 - Example I used R to draw 1,000 samples, each of size 30, from an extremely skewed population with mean and standard deviation both equal to 2. data.mean = data.median = numeric(0) for(i in 1:1000) { data = rt(n=30, 3) data.mean[i] = mean(data) data.median[i] = median(data) } For each sample the mean and median are computed. Do these statistics appear unbiased? Which is more efficient? Professor S. D. Balkin -- March 5, 2003 - 12 - Expressing Uncertainty Suppose we are trying to make inferences about a population mean based on a sample of size n. The sample mean X is a point estimator of the parameter . Used by itself, X is of limited usefulness because it contains no informatio n about its own reliabilit y. Furthermor e, the reporting of X alone may leave the false impression that X estimates with complete accuracy. Professor S. D. Balkin -- March 5, 2003 - 13 - Confidence Interval An interval with random endpoints which contains the parameter of interest (in this case, μ) with a pre-specified probability, denoted by 1 - α. The confidence interval automatically provides a margin of error to account for the sampling variability of the sample statistic. Example: A machine is supposed to fill “12 ounce” bottles of Guinness. To see if the machine is working properly, we randomly select 100 bottles recently filled by the machine, and find that the average amount of Guinness is 11.95 ounces. Can we conclude that the machine is not working properly? Professor S. D. Balkin -- March 5, 2003 - 14 - No! By simply reporting the sample mean, we are neglecting the fact that the amount of beer varies from bottle to bottle and that the value of the sample mean depends on the luck of the draw It is possible that a value as low as 11.75 is within the range of natural variability for the sample mean, even if the average amount for all bottles is in fact μ = 12 ounces. Suppose we know from past experience that the amounts of beer in bottles filled by the machine have a standard deviation of σ = 0.05 ounces. Since n = 100, we can assume (using the Central Limit Theorem) that the sample mean is normally distributed with mean μ (unknown) and standard error 0.005 What does the Empirical Rule tell us about the average volume of the sample mean? Professor S. D. Balkin -- March 5, 2003 - 15 - Why does it work? X X is in here 95% of the time Professor S. D. Balkin -- March 5, 2003 X SX is in here about 95% of the time - 16 - Using the Empirical Rule Assuming Normality Professor S. D. Balkin -- March 5, 2003 - 17 - Confidence Intervals “Statistics is never having to say you're certain”. • (Tee shirt, American Statistical Association). Any sample statistic will vary from sample to sample Point estimates are almost inevitably in error to some degree Thus, we need to specify a probable range or interval estimate for the parameter Professor S. D. Balkin -- March 5, 2003 - 18 - Confidence Interval 100(1 )% CONFIDENCE INTERVAL FOR AND KNOWN Using the sample mean as an estimate of the population mean, allow for sampling error with a plus - or - minus term equal to a z - table value times the standard error of the sample mean : y z / 2 Y y z / 2 Y Professor S. D. Balkin -- March 5, 2003 - 19 - Example An airline needs an estimate of the average number of passengers on a newly scheduled flight Its experience is that data for the first month of flights are unreliable, but thereafter the passenger load settles down The mean passenger load is calculated for the first 20 weekdays of the second month after initiation of this particular flight If the sample mean is 112 and the population standard deviation is assumed to be 25, find a 95% confidence interval for the true, long-run average number of passengers on this flight Find the 90% confidence interval for the mean Professor S. D. Balkin -- March 5, 2003 - 20 - Interpretation The significance level of the confidence interval refers to the process of constructing confidence intervals Each particular confidence interval either does or does not include the true value of the parameter being estimated We can’t say that this particular estimate is correct to within the error So, we say that we have a XX% confidence that the population parameter is contained in the interval Or…the interval is the result of a process that in the long run has a XX% probability of being correct Professor S. D. Balkin -- March 5, 2003 - 21 - Imagine Many Samples Missed! Missed! The interval you computed 22 23 The population mean = 23.29 Professor S. D. Balkin -- March 5, 2003 24 - 22 - Example A signal transmitting value is sent from California, the value received in NY is normally distributed with mean and variance 4. To reduce error, the same value is sent 9 times If the successive values received are: • 5, 8.5, 12, 15, 7, 9, 7.5, 6.5, 10.5 Construct a 99% confidence interval for Professor S. D. Balkin -- March 5, 2003 - 23 - Getting Realistic The population standard deviation is rarely known Usually both the mean and standard deviation must be estimated from the sample Estimate with s However…with this added source of random errors, we need to handle this problem using the t-distribution (later on) Professor S. D. Balkin -- March 5, 2003 - 24 - Confidence Intervals for Proportions We can also construct confidence intervals for proportions of successes Recall that the expected value and standard error for the number of successes in a sample are: E(ˆ ) ; ˆ (1 ) / n How can we construct a confidence interval for a proportion? Professor S. D. Balkin -- March 5, 2003 - 25 - Example Suppose that in a sample of 2,200 households with one or more television sets, 471 watch a particular network’s show at a given time. Find a 95% confidence interval for the population proportion of households watching this show. Professor S. D. Balkin -- March 5, 2003 - 26 - Example The 1992 presidential election looked like a very close threeway race at the time when news polls reported that of 1,105 registered voters surveyed: • Perot: 33% • Bush: 31% • Clinton: 28% Construct a 95% confidence interval for Perot? What is the margin of error? What happened here? Professor S. D. Balkin -- March 5, 2003 - 27 - Example A survey conducted found that out of 800 people, 46% thought that Clinton’s first approved budget represented a major change in the direction of the country. Another 45% thought it did not represent a major change. Compute a 95% confidence interval for the percent of people who had a positive response. What is the margin of error? Interpret… Professor S. D. Balkin -- March 5, 2003 - 28 - Choosing a Sample Size Gathering information for a statistical study can be expensive, time consuming, etc. So…the question of how much information to gather is very important When considering a confidence interval for a population mean , there are three quantities to consider: z / 2 Y / n Professor S. D. Balkin -- March 5, 2003 - 29 - Choosing a Sample Size (cont) Tolerability Width: The margin of acceptable error • ±3% • ± $10,000 Derive the required sample size using: • Margin of error (tolerability width) • Level of Significance (z-value) • Standard deviation (given, assumed, or calculated) Professor S. D. Balkin -- March 5, 2003 - 30 - Example Union officials are concerned about reports of inferior wages being paid to employees of a company under its jurisdiction How large a sample is needs to obtain a 90% confidence interval for the population mean hourly wage with width equal to $1.00? Assume that =4. Professor S. D. Balkin -- March 5, 2003 - 31 - Example A direct-mail company must determine its credit policies very carefully. The firm suspects that advertisements in a certain magazine have led to an excessively high rate of write-offs. The firm wants to establish a 90% confidence interval for this magazine’s write-off proportion that is accurate to ± 2.0% • How many accounts must be sampled to guarantee this goal? • If this many accounts are sampled and 10% of the sampled accounts are determined to be write-offs, what is the resulting 90% confidence interval? • What kind of difference do we see by using an observed proportion over a conservative guess? Professor S. D. Balkin -- March 5, 2003 - 32 - The t Distribution Up until now, we have assumed that the population standard deviation is known or that we choose a large enough sample so the sample standard deviation s can replace . Sometimes a large sample is not possible So far, we’ve based the confidence interval on the z statistic: Y Z / n Professor S. D. Balkin -- March 5, 2003 - 33 - The t Distribution (cont) When the population standard deviation is unknown, it must be replaced by the sample statistic This yields the summary statistic Y t s/ n This statistic follows a t-Distribution Professor S. D. Balkin -- March 5, 2003 - 34 - The t Distribution (cont) This statistic was derived by W. S. Gosset Gosset obtained a post as a chemist in the Guinness brewery in Dublin in 1899 and did important work on statistics He invented the t Distribution to handle small samples for quality control in brewing He wrote under the name "Student" Professor S. D. Balkin -- March 5, 2003 - 35 - Properties of the t Distribution 0.4 Symmetric about the mean 0 More variable than the z-distribution 0.0 0.1 0.2 0.3 Normal t -4 Professor S. D. Balkin -- March 5, 2003 -2 0 2 4 - 36 - Properties of the t Distribution (cont) There are many different t distributions. • We specify a particular one by its degrees of freedom • If a random sample is taken from a normal population, then the statistic: Y t s/ n has a t distribution with d.f. = n-1 As sample size increases, the t-distribution approaches the zdistribution R functions Y pt(t, df) : P t ; Cumulative distributi on function s/ n Y qt(p, df) : P t p; Inverse CDF s/ n Professor S. D. Balkin -- March 5, 2003 - 37 - Degrees of Freedom Technical definition fairly complex Intuitively: d.f. refers to the estimated standard deviation and is used to indicate the number of pieces of information available for that estimate The standard deviation is based on n deviations from the mean, but the deviations must sum to 0, so only n-1 deviations are free to vary Professor S. D. Balkin -- March 5, 2003 - 38 - Example How long should you wait before ordering new inventory? • If you choose too soon, you pay stocking costs • If you choose too late, you risk stock-outs Your supplier says goods will arrive in two weeks (10 business days), but you made note of how many business days it actually took: 10, 9, 7, 10, 3, 9, 12, 5 Calculate the sample mean, standard deviation, and standard error for this sample What is the probability a shipment takes more than two weeks? Professor S. D. Balkin -- March 5, 2003 - 39 - Confident Intervals for the t Distribution 100(1 )% CONFIDENCE INTERVAL FOR AND UNKNOWN Using the sample mean as an estimate of the population mean, allow for sampling error with a plus - or - minus term equal to a t - table value times the standard error of the sample mean : y t / 2 s / n y t / 2 s / n where tα/ 2 is the tabulated t value cutting off a right - tail area of α/ 2 with n-1 d.f. Professor S. D. Balkin -- March 5, 2003 - 40 - Example How long should you wait before ordering new inventory? • If you choose too soon, you pay stocking costs • If you choose too late, you risk stock-outs Your supplier says goods will arrive in two weeks (10 business days), but you made note of how many business days it actually took: 10, 9, 7, 10, 3, 9, 12, 5 Calculate a 95% confidence interval for the mean delivery time Professor S. D. Balkin -- March 5, 2003 - 41 - Assumptions Assumptions needed for validity of the Confidence Interval 1. Data are a RANDOM SAMPLE from the population of interest • (So that the sample can tell you about the population) 2. The sample average is approximately NORMAL • Either the data are normal (check the histogram) • Or the central limit theorem applies: – Large enough sample size n, distribution not too skewed • (So that the t table is technically appropriate) Professor S. D. Balkin -- March 5, 2003 - 42 - Linear Regression and Correlation Methods Chapter 11 Chapter Goals Introduction to Bivariate Data Analysis • Introduction to Simple Linear Regression Analysis • Introduction to Linear Correlation Analysis Interpret scatter plots Professor S. D. Balkin -- March 5, 2003 - 44 - Motivating Example Before a pharmaceutical sales rep can speak about a product to physicians, he must pass a written exam An HR Rep designed such a test with the hopes of hiring the best possible reps to promote a drug in a high potential area In order to check the validity of the test as a predictor of weekly sales, he chose 5 experienced sales reps and piloted the test with each one The test scores and weekly sales are given in the following table: Professor S. D. Balkin -- March 5, 2003 - 45 - Motivating Example (cont) SALESPERSON TEST SCORE WEEKLY SALES JOHN 4 $5,000 BRENDA 7 $12,000 GEORGE 3 $4,000 HARRY 6 $8,000 AMY 10 $11,000 Professor S. D. Balkin -- March 5, 2003 - 46 - Introduction to Bivariate Data Up until now, we’ve focused on univariate data Analyzing how two (or more) variables “relate” is very important to managers • • • • Prediction equations Estimate uncertainty around a prediction Identify unusual points Describe relationship between variables Visualization • Scatterplot Professor S. D. Balkin -- March 5, 2003 - 47 - Scatterplot 8000 4000 6000 sales 10000 12000 Do Test Score and Weekly Sales appear related? 3 4 5 6 7 8 9 10 score Professor S. D. Balkin -- March 5, 2003 - 48 - Correlation Boomers' Little Secret Still Smokes Up the Closet July 14, 2002 …Parental cigarette smoking, past or current, appeared to have a stronger correlation to children's drug use than parental marijuana smoking, Dr. Kandel said. The researchers concluded that parents influence their children not according to a simple dichotomy — by smoking or not smoking — but by a range of attitudes and behaviors, perhaps including their style of discipline and level of parental involvement. Their own drug use was just one component among many… A Bit of a Hedge to Balance the Market Seesaw July 7, 2002 …Some so-called market-neutral funds have had as many years of negative returns as positive ones. And some have a high correlation with the market's returns… Professor S. D. Balkin -- March 5, 2003 - 49 - Correlation Analysis Statistical techniques used to measure the strength of the relationship between two variables Correlation Coefficient: describes the strength of the relationship between two sets of variables • • • • • • • Denoted r r assumes a value between –1 and +1 r = -1 or r = +1 indicates a perfect correlation r = 0 indicates not relationship between the two sets of variables Direction of the relationship is given by the coefficient’s sign Strength of relationship does not depend on the direction r means LINEAR relationship ONLY Professor S. D. Balkin -- March 5, 2003 - 50 - Example Correlations r=-0.73 r=-0.25 -2 -1 0 1 2 -1.0 -2 -1.0 -1 0.0 0.0 0 1 1.0 1.0 2 r=-0.9 -2 -1 0 1 2 -2 -1 0 1 2 1 2 Correlation Demo r=0.7 r=0.88 2 -3 -1.0 -2 -2 -1 -1 0 0 1 1 0.0 0.5 1.0 1.5 3 r=0.34 -2 -1 0 1 Professor S. D. Balkin -- March 5, 2003 2 -2 -1 0 1 2 -2 -1 0 - 51 - Scatterplot 8000 4000 6000 sales 10000 12000 r = 0.88 3 4 5 6 7 8 9 10 score Professor S. D. Balkin -- March 5, 2003 - 52 - Correlation and Causation Must be very careful in interpreting correlation coefficients Just because two variables are highly correlated does not mean that one causes the other • Ice cream sales and the number of shark attacks on swimmers are correlated • The miracle of the "Swallows" of Capistrano takes place each year at the Mission San Juan Capistano, on March 19th and is accompanied by a large number of human births around the same time • The number of cavities in elementary school children and vocabulary size have a strong positive correlation. To establish causation, a designed experiment must be run CORRELATION DOES NOT IMPLY CAUSATION Professor S. D. Balkin -- March 5, 2003 - 53 - Regression Analysis Simple Regression Analysis is predicting one variable from another • Past data on relevant variables are used to create and evaluate a prediction equation Variable being predicted is called the dependent variable Variable used to make prediction is an independent variable Professor S. D. Balkin -- March 5, 2003 - 54 - Introduction to Regression Predicting future values of a variable is a crucial management activity • Future cash flows • Needs to raw materials into a supply chain • Future personnel or real estate needs Explaining past variation is also an important activity • Explain past variation in demand for services • Impact of an advertising campaign or promotion Professor S. D. Balkin -- March 5, 2003 - 55 - Introduction to Regression (cont.) Prediction: Reference to future values Explanation: Reference to current or past values Simple Linear Regression: Single independent variable predicting a dependent variable • Independent variable is typically something we can control • Dependent variable is typically something that is linearly related to the value of the independent variable yˆ ˆ0 ˆ1 x Professor S. D. Balkin -- March 5, 2003 - 56 - Introduction to Regression (cont.) Basic Idea: Fit a straight line that relates dependent variable (y) and independent variable (x) Linearity Assumption: Slope of the equation does not change as x change Assuming linearity we can write y 0 1 x which says that Y is made up of a predictable part (due to X) and an unpredictable part Coefficients are interpreted as the true, underlying intercept and slope Professor S. D. Balkin -- March 5, 2003 - 57 - Regression Assumptions We start by assuming that for each value of X, the corresponding value of Y is random, and has a normal distribution. Professor S. D. Balkin -- March 5, 2003 - 58 - Which Line? 8000 4000 6000 sales 10000 12000 There are many good fitting lines through these points 3 4 5 6 7 8 9 10 score http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html Professor S. D. Balkin -- March 5, 2003 - 59 - Least Squares Principle This method gives a best-fitting straight line by minimizing the sum of the squares of the vertical deviations about the line Regression Coefficient Interpretations: • 0: Y-Intercept; estimated value of Y when X = 0 • 1: Slope of the line; average change in predicted value of Y for each change of one unit in the independent variable X Professor S. D. Balkin -- March 5, 2003 - 60 - Least Square Estimates ˆ1 S xy S xx ; ˆ0 y ˆ1 x where S xx ( xi x ) 2 i S xy ( xi x )( yi y ) i Professor S. D. Balkin -- March 5, 2003 - 61 - Back to the Example simple.lm(score, sales) 8000 4000 6000 y 10000 12000 y = 1133.33 x + 1199.99 3 4 5 6 7 8 9 10 x Professor S. D. Balkin -- March 5, 2003 - 62 - Back to the Example Regression Plot Sales = 1200 + 1133.33 Score S=1,955 12000 11000 Sales 10000 9000 8000 7000 6000 5000 4000 3 4 5 6 7 8 9 10 Score Professor S. D. Balkin -- March 5, 2003 - 63 - Example It is well known that the more beer you drink, the more your blood alcohol level rises. However, the extent to how much it rises per additional beer is not clear. Student 1 2 3 4 5 6 7 8 9 10 Beers 5 2 9 8 3 7 3 5 3 5 BAL 0.100 0.030 0.190 0.120 0.040 0.095 0.070 0.060 0.020 0.050 Calculate the correlation coefficient Perform a regression analysis Professor S. D. Balkin -- March 5, 2003 - 64 - Homework #6 Hildebrand/Ott • • • • • • • • • • HO: HO: HO: HO: HO: HO: HO: HO: HO: HO: 7.1, page 204 7.2, page 204-205 7.14, page 211 7.17, page 211 7.18, page 211 7.20, page 214 7.21, page 214 7.30, page 218 7.39, page 229 7.74, page 244 Professor S. D. Balkin -- March 5, 2003 Verzani • 13.4 – first part (do not test the hypothesis). Provide an interpretation. - 65 -