Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil Comparing Two Proportions We often want to compare the proportions of two groups (such as men and women) that have some characteristics. We call the two groups being compared Population 1 and population 2. The two population proportions of “Successes” P1 and P2. The data consist of two independent SRS The sample sizes are n1 from population 1 and n2 from population 2. Comparing Two Proportions The proportion of successes in each sample estimates the corresponding population proportion. Here is the notation we will use population population proportion Sample size Count of successes Sample proportion 1 P1 n1 X1 pˆ1 X1 n1 2 P2 n2 X2 pˆ 2 X 2 n2 Sampling Distribution of pˆ1 pˆ 2 Choose independent SRS of sizes n1 and n2 from two populations with proportions P1 and P2 of successes. Let D pˆ1 pˆ 2 be the difference between the two sample proportions of successes. Then as both sample sizes increase, the sampling distribution of D becomes approximately Normal. The mean of the sampling distribution is P1 P2 . The standard deviation of the sampling distribution is D P1 (1 P1 ) P2 (1 P2 ) n1 n2 Sampling Distribution of The sampling distribution of the difference of two sample proportions is approximately Normal. The mean and standard deviation are found from the two population proportions of successes, P1 and P2 pˆ1 pˆ 2 Confidence Interval Just as in the case of estimating a single proportion, a small modification of the sample proportions greatly improves the accuracy of confidence intervals. The Wilson estimates of the two population proportions are ~ P1 ( X 1 1) (n1 2) ~ p2 ( X 2 1) (n2 2) Confidence Interval ~ is approximately The standard deviation of D D~ ~ p1 (1 ~ p2 ) ~ p2 (1 ~ p2 ) n1 2 n2 2 To obtain a confidence interval for P1-P2, we replace the unknown parameters in the standard deviation by estimates to obtain an estimated standard deviation, or standard error. Confidence Interval for Comparing Two Proportions Example:”No Sweat” Garment Labels Following complaints about the working conditions in some apparel factories both in the United States and Abroad, a joint government and industry commission recommended in 1998 that companies that monitor and enforce proper standards be allowed to display a “No Sweat” label on their product. A survey of U.S. residents aged 18 or older asked a series of questions about how likely they would be to purchase a garment under various conditions. Example:”No Sweat” Garment Labels For some conditions, it was stated that the garment had a “No Sweat” label; for others, there was no mention of such label. On the basis of of the responses, each person was classified as a “label user” or “ a “label nonuser.” About 16.5% of those surveyed were label users. One purpose of the study was to describe the demographic characteristics of users and nonusers. Example:”No Sweat” Garment Labels The study suggested that there is a gender difference in the proportion of label users. Here is a summary of the data. Let X denote the number of label users. population 1 (women) 2 (men) n 296 251 X 63 27 pˆ X n 0.213 0.108 ~ p ( X 1) (n 2) 0.215 0.111 Example:”No Sweat” Garment Labels First calculate the standard error of the observed difference. SED~ ~ p1 (1 ~ p2 ) ~ p2 (1 ~ p2 ) n1 2 n2 2 (0.215)(0.785) (0.111)(0.889) 0.0308 296 2 251 2 The 95% confidence interval is (~ p1 ~ p2 ) z * SED~ (0.215 0.111) (1.96)(0.0308) .104 0.060 (0.04, 0.16) Example:”No Sweat” Garment Labels With 95% confidence we can say that the difference in the proportions is between 0.04 and 0.16. Alternatively, we can report that the women are about 10% more likely to be label users than men, with a 95% margin of error of 6%. In this example we chose women to be the first population. Had we chosen men as the first population, the estimate of the difference would be negative (-0.104). Because it is easier to discuss positive numbers, we generally choose the first population to be the one with the higher proportion. The choice does not affect the substance of the analysis. Significance Tests It is sometimes useful to test the null hypothesis that the two population proportions are the same. We standardize D pˆ pˆ by subtracting its mean P1-P2 and then dividing by its standard deviation 1 D 2 P1 (1 P1 ) P2 (1 P2 ) n1 n2 If n1 and n2 are large, the standardized difference is approximately N(0, 1). To estimate D we take into account the null hypothesis that P1 = P2. Significance Tests If these two proportions are equal, we can view all of the data as coming from a single population. Let P denote the common value of P1 and P2. The standard deviation of D pˆ pˆ is then 1 Dp P(1 P) P(1 P) n1 n2 1 1 P(1 P) n1 n2 2 Significance Tests We estimate the common value of P by the overall proportion of successes in the two samples. number of successes in both samples X X2 Pˆ 1 number of observatio ns in both samples n1 n2 This estimate of P is called the pooled estimate. To estimate the standard deviation of D, substitute p̂ for P in the expression for DP. The result is a standard error for D under the condition that the null hypothesis H0: P1 = P1 is true. The test statistic uses this standard error to standardize the difference between the two sample proportions. Significance Tests for Comparing Two Proportions Example:men, women, and garment labels. The previous example presented the survey data on whether consumers are “label users” who pay attention to label details when buying a shirt. Are men and women equally likely to be label users? Here is the data summary: Population n X 1 (women) 2 (men) 296 251 63 27 pˆ X n 0.213 0.108 Example:men, women, and garment labels We compare the proportions of label users in the two populations (women and men) by testing the hypotheses H0:P1= P2 Ha:P1 P2 The pooled estimate of the common value of P is: pˆ 63 27 90 0.1645 296 251 547 This is the proportion of label users in the entire sample. Example:men, women, and garment labels The test statistic is calculated as follows: 1 1 SEDP (0.1645)(0.8355) 0.03181 296 251 z pˆ 1 pˆ 2 0.213 0.108 3.30 SEDP 0.03181 The observed difference is more than 3 standard deviation away from zero. Example:men, women, and garment labels The P-value is: 2 P( z 3.30) 2 (1 0.9995) 2 0.0005 0.001 Conclusion: 21% of women are label users versus only 11% of men; the difference is statistically significant. Simple Regression Simple regression analysis is a statistical tool That gives us the ability to estimate the mathematical relationship between a dependent variable (usually called y) and an independent variable (usually called x). The dependent variable is the variable for which we want to make a prediction. While various non-linear forms may be used, simple linear regression models are the most common. Introduction • The primary goal of quantitative analysis is to use current information about a phenomenon to predict its future behavior. • Current information is usually in the form of a set of data. • In a simple case, when the data form a set of pairs of numbers, we may interpret them as representing the observed values of an independent (or predictor ) variable X and a dependent ( or response) variable Y. lot size Man-hours 30 73 20 50 60 128 80 170 40 87 50 108 60 135 30 69 70 148 60 132 Introduction The goal of the analyst who studies the data is to find a functional relation y f (x) between the response variable y and the predictor variable x. Statistical relation between Lot size and Man-Hour 180 160 140 120 Man-Hour 100 80 60 40 20 0 0 10 20 30 40 50 Lot size 60 70 80 90 Regression Function The statement that the relation between X and Y is statistical should be interpreted as providing the following guidelines: 1. Regard Y as a random variable. 2. For each X, take f (x) to be the expected value (i.e., mean value) of y. 3. Given that E (Y) denotes the expected value of Y, call the equation E (Y ) f ( x) the regression function. Historical Origin of Regression Regression Analysis was first developed by Sir Francis Galton, who studied the relation between heights of sons and fathers. Heights of sons of both tall and short fathers appeared to “revert” or “regress” to the mean of the group. Basic Assumptions of a Regression Model A regression model is based on the following assumptions: 1. There is a probability distribution of Y for each level of X. 2. Given that y is the mean value of Y, the standard form of the model is Y f (x ) where is a random variable with a normal distribution. Statistical relation between Lot Size and number of man-Hours-Westwood Company Example Statistical relation between Lot size and number of Man-Hours 180 160 140 120 100 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 Pictorial Presentation of Linear Regression Model Construction of Regression Models Selection of independent variables Functional form of regression relation Scope of model Uses of Regression Analysis Regression analysis serves Three major purposes. 1. Description 2. Control 3. Prediction The several purposes of regression analysis frequently overlap in practice Formal Statement of the Model General regression model Y 0 1 X 1. 0, and 1 are parameters 2. X is a known constant 2 3. Deviations are independent N(o, ) Meaning of Regression Coefficients The values of the regression parameters 0, and 1 are not known.We estimate them from data. 1 indicates the change in the mean response per unit increase in X. Regression Line If the scatter plot of our sample data suggests a linear relationship between two variables i.e. y 0 1 x we can summarize the relationship by drawing a straight line on the plot. Least squares method give us the “best” estimated line for our set of sample data. Regression Line We will write an estimated regression line based on sample data as yˆ b0 b1 x The method of least squares chooses the values for b0, and b1 to minimize the sum of squared errors n n i 1 i 1 2 SSE ( yi yˆ i ) 2 yi b0 b1 xi Regression Line Using calculus, we obtain estimating formulas: n b1 (x i i 1 x )( yi y ) n (x i 1 i x )2 b0 y b1 x Estimation of Mean Response Fitted regression line can be used to estimate the mean value of y for a given value of x. Example The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table. y 1250 1380 1425 1425 1450 1300 1400 1510 1575 1650 x 41 54 63 54 48 46 62 61 64 71 Point Estimation of Mean Response From previous table we have: x 564 x 32604 y 14365 xy 818755 n 10 2 The least squares estimates of the regression coefficients are: b1 n xy x y n x 2 ( x ) 2 10(818755) (564)(14365) 10.8 10(32604) (564) 2 b0 1436.5 10.8(56.4) 828 Point Estimation of Mean Response The estimated regression function is: ŷ 828 10.8x Sales 828 10.8 Expenditur e This means that if the weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8. Point Estimation of Mean Response Fitted values for the sample data are obtained by substituting the x value into the estimated regression function. For example if the advertising expenditure is $50, then the estimated Sales is: Sales 828 10.8(50) 1368 This is called the point estimate of the mean response (sales). Residual The difference between the observed value yi and the corresponding fitted value . ˆi ei yi y Residuals are highly useful for studying whether a given regression model is appropriate for the data at hand. Example: weekly advertising expenditure y 1250 1380 1425 1425 1450 1300 1400 1510 1575 1650 x 41 54 63 54 48 46 62 61 64 71 y-hat 1270.8 1411.2 1508.4 1411.2 1346.4 1324.8 1497.6 1486.8 1519.2 1594.8 Residual (e) -20.8 -31.2 -83.4 13.8 103.6 -24.8 -97.6 23.2 55.8 55.2