Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STATISTICS FOR MANAGERS LECTURE 3: LOOKING AT DATA AND MAKING INFERENCES 1. LOOKING AT DATA Central part of statistics: describing/ summarizing data Take into account that data come in different types Sales Security rating Sector 1.1 TYPES OF DATA Qualitative/categorical Attribute (nominal) data Ranked (ordinal) data Quantitative/numerical Different types of data require different treatment One can use: Graphical summaries Numerical summaries 1.2 QUALITATIVE DATA Graphical summaries Pie chart Bar chart Ordered bar chart Numerical summaries Frequency tables Percentage tables 1.3 QUANTITATIVE DATA Graphical summaries Run chart: Example: stock prices Histogram. Example: tick data Box plot Numerical summaries Arithmetic mean Median Standard deviation Quartiles 1.3.1 RUN CHART For data collected over time (time series) X-axis: date or number of data point Y-axis: numerical value of data point Things to look for Trends Seasonality Cycles Outliers 1.3.1 RUN CHART (cont.) Figure 2. Ratio of survey income to NAS consumption per capita. P rovincial averages over time. 1.2 1.18 1.16 1.14 1.12 1.1 1.08 1.06 1.04 1.02 1 1988 1989 1990 1991 1992 1993 1994 1995 1996 Weighted by population 1997 1998 1999 Unw eighted 2000 2001 1.3.1 RUN CHART (cont.) FIGURE 6: DAY 27/02/97 TRANSACTIONS-CLOCK TIME RELATIONSHIP 30000 20000 15000 10000 5000 Transactions 4281 4067 3853 3639 3425 3211 2997 2783 2569 2355 2141 1927 1713 1499 1285 1071 857 643 429 215 0 1 Seconds after 9 a.m. 25000 1.3.2 HISTOGRAM Determine the range of data Decompose into bins of equal width Count how many data points fall within each bin Construct a bar chart based on these counts Only problem: have to choose the width of the bin Allows to judge Center/location Spread/variation Symmetry Outliers 1.3.2 HISTOGRAM (cont.) F IGU R E 12 : D A Y 2 2 / 0 2 / 9 7 F R EQU EN C Y OF PR IC E C HA N GES ( T IC KS) 3500 3000 2500 2000 1500 1000 500 0 -4 -3 -2 -1 0 1 2 3 4 1.3.3 BOX PLOT Pack a lot of information in a single plot Box that extend from Q1 to Q3 A line inside the box indicates the median Whiskers extend to bottom and top Outliers are denoted by asterisks Can compare data sets by lining up their box plots. 1.3.3 BOX PLOT (cont.) .8 1 1.2 1.4 1.6 1.8 Figure 3. Box plot of the ratio over time 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 1.3.4 LOCATION Mean: sum up all the data and divide by the number of points Median Sort all the data from smallest to largest Take the middle one (for odd number of data) Take the average of the middle two (for even number of data) 1.3.4 LOCATION (cont.) Mean versus median The median is more robust than the mean. This means that it is less affected by extreme observations As a function of symmetry of the data • Skewed to the left: mean<median • Symmetric: mean approximately equal to median • Skewed to the right: mean>median For skewed data the median is a more typical observation 1.3.5 SPREAD Standard deviation Measures a typical deviation from the mean Do not bother to do it yourself. Let EXCEL or any other program do it for you. Inter-quartile range Q1 is median of the bottom half of data Q3 is median of the top half of data IQR=Q3-Q1 1.3.6 OUTLIER DETECTION Graphically Use histogram Look for points away from the rest Numerically Points more than 3 standard deviations away from the mean Points more than 1.5*IQR away from Q1 and Q3. 2. SAMPLING All statistical information is based on data The process of collecting data is called sampling It is important to do it right Not everybody seems to understand this importance 2. GENERAL SITUATION We study a population Can be a population in the strict sense but it could also be an experiment We are interested in certain characteristics of the population (parameter) Want to learn as much as possible about the parameter 2. EXAMPLES Population of Beijing What is the average income? What percentage speak Cantonese? What percentage has Internet? What is the average price of the square meter? (300.000 euros buy only 174 squares meters). What is the percentage of people that have a DVD? 2. BASIC PROBLEM Most populations are very large, or even infinite Hence it is typically impossible to exactly determine a parameter (sometime unfeasible from a cost perspective) But it is possible to learn something about a parameter By collecting a sample from the population we can obtain information But the quality of information cna only be as good as the quality of the sample 2. GOOD SAMPLE, IS THIS HARD? The sample has to be representative of the population In collecting data, we must not favor (or disfavor) any particular segment of the population If we do we get biased samples Biased samples yield biased estimates. Example of biased samples: Internet. 2. NO VOLUNTEERS PLEASE A sample into which people have entered at their own choice is called voluntary response sample or self-selected. This typically happens when polls are posted on the internet, the TV,.. The scheme favors people with strong opinions. The resulting sample is rarely representative of the population As so often you get what you pay for! (although something they pay for!) 2. HOW TO DO IT RIGHT Analogy Have one ball per member of the population Put all the balls in a big urn Mix well Take out n balls The result is called simple random sample 2. DO IT RIGHT... There are other ways to get representative samples: Stratified sampling Systematic sampling Cluster sampling (multistage) 2. ... BUT AN ESTIMATE IS JUST THAT We can estimate a parameter from the sample (a mean or a proportion) ... but an estimate is not equal to the parameter! ... Because a sample is not equal to a population We must be aware of sampling error Many people are not! They sell us estimates as if they were parameters. Shame on them. Will do it right. 3. BASIC ESTIMATION General estimation We are interested in a population parameter We collect a random sample In a first step we estimate the parameter. This is usually straighforward. In a second step, we deal with the sampling error. This requires more work but it is worhwhile. 3.1. ESTIMATING A PROPORTION We are interested in a population proportion p We collect a random sample size n We compute the sample proportion p̂ This is a natural estimator for p, But due to the sampling error is not equal to the true parameter p Goal: quantify the sample uncertainty contained in the estimator of p Intuition: the larger n the smaller the uncertainty. 3.1. ESTIMATING A PROPORTION From probability theory we know that the central limit theorem applies, under some assumptions For n large, then with a probability 95% the population proportion p will be in between pˆ (1 pˆ ) n For the interval to be trusted we require pˆ 1.96 npˆ 10 and n(1 pˆ ) 10 3.1. ESTIMATING A PROPORTION A confidence interval has the following general form CI=estimator ±constant x std error (SE) =estimator ± margin of error (ME) For a proportion pˆ (1 pˆ ) SE= n The SE does not depend on the confidence level but the ME does because of the constant, which is often abbreviated as z 3.1. ESTIMATING A PROPORTION How is the ME affected by its various inputs? ME= z 1. 2. 3. pˆ (1 pˆ ) n As the confidence level increases the ME goes up. As the estimator moves towards 0.5 the ME goes up As n increases the ME goes down We control de confidence level and n, but not the estimator of p 3.1. ESTIMATING A PROPORTION Want a CI with a specified level and a specif ME? How large a sample size n is needed? Use ME and solve for n pˆ (1 pˆ ) ME= z n 2 Solution: n z pˆ (1 pˆ ) ME Catch 22: we have not collected the sample yet, and therefore the estimate for p is not available yet. Solutions: 1. Worst case scenario estimator=0.5 2. Use a guess based on previous information WHAT CONFIDENCE LEVEL? You may want a confidence level other than 95%. Most common: 90%, 95% and 99%. The formula for the CI is equal You only change the constant 1.96 Conf. Level 90% 95% 99% Constant z 1.64 1.96 2.57 Higher confidence level give a wider interval 3.2. ESTIMATING A MEAN We are interested in a population mean and use as estimator the sample mean. CI=estimator ±constant x std error (SE) =estimator ± margin of error (ME) For a mean SE= s n CI= X z s n Rule of thumb: need more than 50 obs. To trust this interval 3.2. ESTIMATING A MEAN How is the margin of error (ME) affected by its various inputs? ME= z s n As confidence level increases, ME goes up 2. As s increases the ME goes up 3. As n increases the ME goes down We control n and conf. level but not s. 1. 3.2. ESTIMATING A MEAN Want a CI with a specified level and a specif ME? How large a sample size n is needed? Use ME and solve for n s ME= z n Solution: n zs ME 2 Catch 22: we have not collected the sample yet, and therefore the estimate for p is not available yet. In this case there is not worst case scenario. Use a guess based on previous information 3.3. HYPOTHESIS TESTING If you care about wether a parameter is equal to a certain prespecified value, there is an alternative to hypothesis testing Just check whether the prespecified value is contained in the confidence interval 3.3. HYPOTHESIS TESTING If the prespecified value is contained in the CI It is one of the (many) plausible values So we can only make a weak positive statement If the prespecified values is not contained in the CI It is not one of the plausible values We can make a strong negative statement 3.3. CI PERFECT SUBSTITUTE We wonder if a parameter is equal to a prespecified value? The technique of hypothesis testing give a “yesor-no” answer (at a certain level of significance) We can get the same from the level of confidence ... But in addition we get the range of all plausible values! This is valuable info. Moral: a confidence interval tends to be safer and more informative than hypothesis testing 3.4. CAVEAT Our confidence intervals are simple, yet powerful. But you can’t use them blindly! Two conditions to trust them We need a large sample We need a random sample Data that is collected over time is usually NOT a random sample: the data point of today is usually related to the data point yesterday Stock returns are an exception to this rule. Small sample and time series are for the pros! 3.5. WHAT ABOUT OTHER PARAMETERS We have covered confidence intervals for Population proportions Population means Both are based on the CLT There are other interesting parameters Population median Population standard deviation etc Unfortunately they cannot be handle by the CLT CI can be constructed but the corresponding techniques are more difficult.