Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BA 555 Practical Business Analysis Agenda Housekeeping Review of Statistics Exploring Data Sampling Distribution of a Statistic Confidence Interval Estimation Hypothesis Testing 1 Definition “Statistics” is the science of data. It involves collecting, classifying, summarizing, organizing, analyzing, and interpreting numerical information. We will learn how to make based on data 2 Fundamental Elements of Statistics A population is a set of units (usually people, objects, transactions, or events) that we are interested in studying. It is the totality of items or things under consideration. A sample is a subset of the units of a population. It is the portion of the population that is selected for analysis. A parameter is a numerical descriptive measure of a population. It is a summary measure that is computed to describe a characteristic of an entire population. A statistic is a numerical descriptive measure of a sample. It is a summary measure calculated from the observations in the sample. 3 Example A manufacturer of computer chips claims that less than 10% of his products are defective. When 1,000 chips were drawn from a large production, 7.5% were found to be defective. What is the population of interest? What is the sample? What is the parameter? What is the statistic? Does the value 10% refer to the parameter or to the statistic? Is the value 7.5% a parameter or a statistic? 4 statistics:x , s2, s,p̂ , etc. x1, x2, …, xn Sample of size n Qualitative Quantitative Organizing data: Estimation Hypothesis Testing Regression Analysis Contingency Tables Drawing conclusions from data: Random variables, Probability, Distributions Discrete: binomial distribution Continuous: normal distribution, Sampling distribution of the sample mean Describing uncertainty: X1, X2, …, Xn Selecting a random sample: parameters: , 2, , p, etc. POPULATION Statistical Analysis (p.3) 5 Types of Data (p.2) Numerical (Quantitative) Data Regular numerical observations. Arithmetic calculations are meaningful. Age Household income Starting salary Categorical (Qualitative) Data Values are the (arbitrary) names of possible categories. Gender: Female = 1 vs Male = 0. College major 6 Employee Database (class website, EmployeeDB.sf3) Quantitative Qualitative 7 Describing Qualitative Data (p.4) Qualitative Data Graphical Methods Pie chart Bar chart Line graph Numerical Methods Frequency tables (Categorical Data) e.g. gender, college major, etc. Display one variable: Histogram Stem-and-Leaf Display Dot plot Measures of Location: Mean: 1 n sample mean X X i n i 1 1 N population mean Xi N i 1 Display two variables: Scatter plot Display one variable over time: Time series plot Measures of Relative Standing: Percentiles: Median: Arrange the observations in ascending order. If n is odd, median = the middle number If n is even, median = the simple average of the middle two observations. Mode: The measurement that occurs most frequently in 8 Summarizing Qualitative Data Barchart for Gender Piechart for Gender Gender F M 39.44% F M 60.56% 0 10 20 30 40 50 frequency Frequency Table for Gender -------------------------------------------------------------Relative Cumulative Cum. Rel. Class Value Frequency Frequency Frequency Frequency -------------------------------------------------------------1 F 28 0.3944 28 0.3944 2 M 43 0.6056 71 1.0000 -------------------------------------------------------------- 9 Describing Quantitative Data (p.4) Graphical Methods 10 Quantitative Data: Histogram Histogram for SALARY 24 frequency 20 16 12 8 4 Histogram for AGE 0 23 28 33 38 43 48 53 20 63 (X 1000) frequency Salary (in $000) 58 16 12 8 4 0 Histogram applet 30 35 40 45 50 Age 55 60 65 11 Describing Quantitative Data (p.4) Descriptive/Summary Statistics 12 Guessing Correlations -0.99 -0.29 0.54 0.95 13 Correlation: Be Careful Scatter plot Correlation value ? Correlation 14 Example Given the data below, complete the following summary statistics table. (Data are in ascending order): 10.0, 10.5, 12.2, 13.9, 13.9, 14.1, 14.7, 14.7, 15.1, 15.3, 15.9, 17.7, 18.5 Count Average Median Variance Standard deviation Minimum Maximum Range Lower quartile Upper quartile Interquartile range Sum Variable X 13 14.3462 14.7 5.94936 2.43913 10.0 18.5 8.5 13.9 15.3 1.4 186.5 Box-and-Whisker Plot 10 12 14 16 18 20 Variable X Lower invisible line: 11.8 Upper invisible line: 17.4 15 Statgraphics Plus (SG+) Demo (p.1) Questions to ask when describing and summarizing data: Where is the approximate center of the distribution? Are the observations close to one another, or are they widely dispersed? Is the distribution unimodal, bimodal, or multimodal? If there is more than one mode, where are the peaks, and where are the valleys? Is the distribution symmetric? If not, is it skewed? If symmetric, is it bell-shaped? 16 The Empirical Rule (p.5) 1. Approximately 68% of the observations will fall within 1 standard deviation of the mean. 2. Approximately 95% of the observations will fall within 2 standard deviations of the mean. 3. Approximately 99.7% of the observations will fall within 3 standard deviations of the mean. 99.7% 99.7% 95% 95% 68% 0.15% x 3s 3 x 3s 3 0.15% 3 3 2.35% 13.5% 68% 34% 34% 13.5% x s x 34%x s 34% x 2s13.5%x 3s x 2s 13.5% x22 s x s x x s2 3 x 2s 1 0 1 2 3 2 2 2 0 1 1 2 2 2.35% 2.35% 0.15% 2.35% 0.15% x 3s 3 3 17 Example The average salary for employees with similar background/skills/etc. is about $120,000. Your salary is $122,000. Is it a big deal? Why or why not? What additional information is required to answer this question? 18 What to do next? Generalize the results from the empirical rule. Justify the use of the mound-shaped distribution. 99.7% 95% 68% 0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15% x 3s 3 x 2s xs x xs x 2s 2 2 3 2 1 0 x 3s 3 1 2 3 19 Example: Warranty Level Mean = 30,000 miles STD = 5,000 miles Q1: If the level of warranty is set at 15,000 miles, about what % of tires will be returned under warranty? Q2: If we can accept that up to 2.5% of tires can be returned under warranty, what should be the new warranty level? 0.04 0.03 0.02 0.01 0 0 10 20 30 40 50 60 20 Example: Warranty Level Mean = 30,000 miles STD = 5,000 miles Q1: If the level of warranty is set at 12,000 miles, about what % of tires will be returned under the warranty? Q2: If we can accept that up to 3.0% of tires can be returned under warranty, what should be the warranty level? 0.04 0.03 0.02 0.01 0 0 10 20 30 40 50 60 21 Normal Probabilities z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 .00 .0000 .0398 .0793 .1179 .1554 .1915 .2257 .2580 .2881 .3159 .3413 .3643 .3849 .4032 .4192 .4332 .4452 .4554 .4641 .4713 .4772 .4821 .4861 .4893 .4918 .4938 .4953 .4965 .4974 .4981 .4987 .01 .0040 .0438 .0832 .1217 .1591 .1950 .2291 .2611 .2910 .3186 .3438 .3665 .3869 .4049 .4207 .4345 .4463 .4564 .4649 .4719 .4778 .4826 .4864 .4896 .4920 .4940 .4955 .4966 .4975 .4982 .4987 .02 .0080 .0478 .0871 .1255 .1628 .1985 .2324 .2642 .2939 .3212 .3461 .3686 .3888 .4066 .4222 .4357 .4474 .4573 .4656 .4726 .4783 .4830 .4868 .4898 .4922 .4941 .4956 .4967 .4976 .4982 .4987 .03 .0120 .0517 .0910 .1293 .1664 .2019 .2357 .2673 .2967 .3238 .3485 .3708 .3907 .4082 .4236 .4370 .4484 .4582 .4664 .4732 .4788 .4834 .4871 .4901 .4925 .4943 .4957 .4968 .4977 .4983 .4988 .04 .0160 .0557 .0948 .1331 .1700 .2054 .2389 .2704 .2995 .3264 .3508 .3729 .3925 .4099 .4251 .4382 .4495 .4591 .4671 .4738 .4793 .4838 .4875 .4904 .4927 .4945 .4959 .4969 .4977 .4984 .4988 .05 .0199 .0596 .0987 .1368 .1736 .2088 .2422 .2734 .3023 .3289 .3531 .3749 .3944 .4115 .4265 .4394 .4505 .4599 .4678 .4744 .4798 .4842 .4878 .4906 .4929 .4946 .4960 .4970 .4978 .4984 .4989 .06 .0239 .0636 .1026 .1406 .1772 .2123 .2454 .2764 .3051 .3315 .3554 .3770 .3962 .4131 .4279 .4406 .4515 .4608 .4686 .4750 .4803 .4846 .4881 .4909 .4931 .4948 .4961 .4971 .4979 .4985 .4989 .07 .0279 .0675 .1064 .1443 .1808 .2157 .2486 .2794 .3078 .3340 .3577 .3790 .3980 .4147 .4292 .4418 .4525 .4616 .4693 .4756 .4808 .4850 .4884 .4911 .4932 .4949 .4962 .4972 .4979 .4985 .4989 .08 .0319 .0714 .1103 .1480 .1844 .2190 .2517 .2823 .3106 .3365 .3599 .3810 .3997 .4162 .4306 .4429 .4535 .4625 .4699 .4761 .4812 .4854 .4887 .4913 .4934 .4951 .4963 .4973 .4980 .4986 .4990 .09 .0359 .0753 .1141 .1517 .1879 .2224 .2549 .2852 .3133 .3389 .3621 .3830 .4015 .4177 .4319 .4441 .4545 .4633 .4706 .4767 .4817 .4857 .4890 .4916 .4936 .4952 .4964 .4974 .4981 .4986 .4990 22 Sampling Distribution (p.6) The sampling distribution of a statistic is the probability distribution for all possible values of the statistic that results when random samples of size n are repeatedly drawn from the population. When the sample size is large, what is the sampling distribution of the sample mean / sample proportion / the difference of two samples means / the difference of two sample proportions? NORMAL !!! 23 Central Limit Theorem (CLT) (p.6) If X ~ N(, ), then X ~ N( 2 X , 2 X 2 n ) Sample: X1, X2, …, Xn X P ( a X b) ? 24 Central Limit Theorem (CLT) (p.6) If X ~ Any distribution with the mean , and variance 2, then X ~ N( , n ) for large n. 2 X 2 X Sample: X1, X2, …, Xn X P ( a X b) ? 25 Standard Deviations Population standard deviation X or simply . Sample standard deviation s X or simply s . Standard deviation of sample means (aka. standard error) X Standard deviation of sample proportions (aka. standard error) p̂ Relationships: o o X pˆ X n sX n p(1 p) n : ˆ X or s X pˆ (1 pˆ ) : ˆ pˆ or s pˆ n 26 Statistical Inference: Estimation Population Research Question: What is the parameter value? Sample of size n Tools (i.e., formulas): Point Estimator Interval Estimator 27 Confidence Interval Estimation (p.7) 28 Example A random sampling of a company’s monthly operating expenses for a sample of 12 months produced a sample mean of $5474 and a standard deviation of $764. Construct a 95% confidence interval for the company’s mean monthly expenses. 29 Statistical Inference: Hypothesis Testing Population Research Question: Is the claim supported? Sample of size n Tools (i.e., formulas): z or t statistic 30 Hypothesis Testing (p.9) 31 Example A bank has set up a customer service goal that the mean waiting time for its customers will be less than 2 minutes. The bank randomly samples 30 customers and finds that the sample mean is 100 seconds. Assuming that the sample is from a normal distribution and the standard deviation is 28 seconds, can the bank safely conclude that the population mean waiting time is less than 2 minutes? 32 Margin of Error (B) (point estimator) (multiplie r) (std of point estimator) B: margin of error • What does B tell us about the point estimator? • How do we reduce the value of B? X z / 2 n X t / 2 s n pˆ z / 2 pˆ (1 pˆ ) n 33 Relations among B, n, and B (margin of error) N (sample size) How to reduce B? Confidence Level (e.g., 90%, 95%) 34 Estimation in Practice Determine a confidence level (say, 95%). How good do you want the estimate to be? (define margin of error) Use formulas (p.8) to find out a sample size that satisfies pre-determined confidence level and margin of error. Parameter p Sample Size Needed 2 2 1. Replace or s with the one from z / 2 z / 2 s or n a previous study, or B B 2. Estimate it by range 4 or range/6. 2 1. use the p from a similar study or z / 2 p(1 p) previous experiment. n B 2. be conservative. Use p = 0.5. 35 Accuracy Gained by Increasing the Sample Size (p.8) 2 1 1 1.96 2 2 A 95% Confidence Interval for p: n B Margin of Error (B) Sample Size (n) 7% 196 6% 266 5% 384 4% 600 3% 1067 2% 2401 1% 9604 36