Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
German tank problem wikipedia , lookup
Gibbs sampling wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Sample Size Determination Population A: 10,000 Population B: 5,000 Sample 15% Sample 10% Sample size 1000 Sample size 750 Sampling Ø The process of obtaining information from a subset (sample) of a larger group (population) Ø The results for the sample are then used to make estimates of the larger group Ø Faster and cheaper than asking the entire population Ø Two keys 1. Selecting the right people § Have to be selected scientifically so that they are representative of the population 2. Selecting the right number of the right people § To minimize sampling errors I.e. choosing the wrong people by chance Three Issues 1. Financial 2. Managerial 3. Statistical Sample size Selecting the right number of the right people Cost of research Generally, the larger the sample size the smaller the statistical error, but the greater the cost, both financial and in terms of managerial resources 1 SubGroups Male Female Totals <35 100 100 200 35+ 100 100 200 Totals 200 200 400 The number of subgroups to be analyzed will have an impact on the size of the sample needed. As the number of subgroups increases the sampling error increases and it becomes harder to tell whether differences between two groups are real or due to error Determining sample size Balance between financial and statistical issues 1. What can I afford A critical factor will be the size of the expected difference or 2. Rule of thumb change to be measured, The past experience smaller it is, the larger the historical precedence sample needs to be. gut feeling some consideration of sample error 3. Make up of sub-groups (cells) What statistical inferences do you hope to make between sub groups (rare to fall below 20 for a sub group) 4. Statistical Methods Statistical determination Three Pieces of Information Required 1. An estimate of the population Standard Deviation 2. The Acceptable Level of Sampling Error 3. The Desired Level of Confidence that the Sample Result will fall within a certain range (result +/- sampling error) of true population values 2 Normal Distribution σ -∝ µ a b ∝ The height of a normal distribution can be uniquely specified mathematically in terms of two parameters: the mean (µ) and the standard deviation (σ). IQ The total area under the curve is equal to 1. I.e. It takes in all observations The area of a region under the normal distribution between any two values equals the probability of observing a value in that range when an observation is randomly selected from the distribution For example, on a single draw there is a 34% chance of selecting from the distribution a person with an IQ between 100 and 115 Normal Distributions Ø Curve is basically bell shaped from - ∝ to ∝ Ø symmetric with scores concentrated in the middle (i.e. on the mean) than in the tails. ØMean, medium and mode coincide ØThey differ in how spread out they are. 3 Standard Normal Distribution (z) Any normal distribution can be converted into a standard normal distribution by a simple transformation formula. Z= value of the variable – Mean of variable/SD of the variable The mean always = zero; standard deviation always equal to one. The probabilities in the tables are always based on a normal distribution Area Under Standard Normal Curve for Z values (Standard deviations) of 1, 2 and 3 Z values (Standard deviations) Area Under Standard Normal Curve % +/- 1 68.26 +/- 2 95.44 +/- 3 99.74 Population Vs. Sample Population of Interest Population Sample Sample Parameter Statistic We measure the sample using statistics in order to draw inferences about the population and its parameters. Population Mean = μ Standard Deviation σ Sample Mean = X Standard Deviation S 4 Sampling Distribution of the Mean ØNecessary for understanding the basis for computing sampling error for simple random samples. ØA conceptual and theoretical probability distribution of the means of all possible samples of a given size drawn from a given population Øi.e. A distribution of sample means. ØIf you take a sample of 100 from a population of 1000 there are are thousands of different subsets of the population that can be drawn, each sample will have a slightly different mean. Those means will have also have a distribution. ØCentral Limit Theory says that that distribution will approximate a normal distribution the larger the number of samples drawn Suppose you conducted a research study • Took a random sample of n=100 subjects • They tasted the new "Guacamole Doritos” • They rated the flavor of the chip on the following scale: 1 Too Mild 2 3 4 5 Perfect Flavor 6 7 Too Hot Results show : x1 = 2.3 and S1= 1.5 • Can you conclude that on average the target population thought the flavor was mild? • Suppose you take a series of random samples of n=100 subjects: x2 = 3.7 and S2 = 2 x3 = 4.3 and S3 = 0.5 x4 = 2.8 and S4 = .97 .. . x50 = 3.7 and S50 = 2 5 The Sampling Distribution The means of all the samples will have their own distribution called the sampling distribution of the means It is a normal distribution The mean of the sampling distribution of the mean equals the population parameter X = (ΣXi)/n Sampling Distribution The standard deviation of the sampling distribution is called the sampling error of the mean σp= √π(1-π)/n Often the population standard deviation σ is unknown and has to be estimated from the sample S = σ √ Σ(Xi-X)/n-1 Population distribution of the Doritos’ flavor (X) σ X µ Sample distribution of the x Doritos’ flavor x 1 2 3 4 5 6 7 6 • What relationship does the Population Distribution have to the Sample Distribution? The Central Limit Theorem Let x1, x2….. xn denote a random sample selected from a population having mean µ and variance σ2. Let X denote the sample mean. If n is large, the X has approximately a Normal Distribution with mean µ and variance σ2/n. • The Central Limit Theorem does not mean that the sample mean = population mean. • It means that you can attach a probability to that value and decide. The sampling distribution of the mean for simple random samples that are over 30 has the following characteristics 1. The distribution is a normal distribution 2. The distribution has a mean equal to the population mean 3. The distribution has a standard deviation (the standard error of the mean ) equal to the population standard deviation divided by the square root of the sample size σ = σ / √n X Note: The statistic is referred to as the standard error of the mean instead of the standard deviation to indicate that it applies to a distribution of sample means rather than the SD of a sample or of the population Sampling Distribution of Proportions ØWe are often interested in estimating proportions or percentages rather than means ØIs the sample proportion representative of the population proportion ØThe percentage of the population that has used the product ØThe percentage of the population that has purchased over the Internet in the last month ØThe proportion of men who read a particular magazine ØThe sampling distribution of the proportion approximates a normal distribution ØThe mean proportion of all possible samples is equal to the population proportion ØThe standard error of a sampling distribution can be calculated 7 ØIn practice we want to make inferences from our sample about the population it was drawn from ØWhat is the probability that our sample of any given size will produce an estimate that is within one standard error (plus or minus) of the true population ØThe answer is 68.26% that any one sample from a particular population will produce an estimate of the population mean that is within +/- one standard error of the true value. ØThis is because 68.26% of all sample means from a given population fall in this range ØThere is a 95.44% probability that the mean from any one sample will within +/- two SDs Sampling Distribution of Means Point Estimates Ø The sample mean is the best point estimate of a population mean ØThe sample mean is most likely to be close to the population mean, but could be any of the means on the left – including one that is a far distance from the population mean. ØThe distance between the sample mean and the population mean is the sampling error ØOnly a small percentage of samples will have the same mean as the population (I.e. a sampling error of zero) Interval Estimates ØInterval estimates are preferred ØAn interval estimate is a range of all values within which the true population mean is estimated to fall ØNormally state the size of the interval, plus the probability that the interval will include the true population mean. ØThe probability is called the confidence level (e.g. 95%) ØAnd the Interval is called the confidence interval (e.g. between 72 and 98) 8 Sample Confidence “Probability” we can take results as “accurate representation” of universe (i.e. that “sample statistics” are generalisable to the real “population parameters”) Typically a 95% probability (i.e. 19 times out of 20 we would expect results in this range) Example: We can be 95% sure that, say, 65% of a target market will name Martini’s “V2” vodka in an unprompted recall test plus or minus 4% We can be 95% sure (level of confidence) that, say, 65% (predicted result) of a target market (of a given total population) will name Martini’s “V2” vodka in an unprompted recall test plus or minus 4% (to a known margin of error) 9 95% confidence If we do the same test 20 times then it is statistically probable that the results will fall between 61-69 %, (i.e. 65 +/ 4%) at least 19 times If we lower the probability then we lower the sample error e.g.. at a 90% confidence level, result might be between 64% - 66% (a tighter range but we are less sure the sample is representative of the real population) Implications for sample size (Given reliability and validity hold) Above a certain size little extra information is gathered by increasing the sample size. Generally, there is no relationship between the size of a population and the size of sample needed to estimate a particular population parameter, with a particular error range and level of confidence. To determine Sample Size we need three pieces of information 1. The acceptable level of sampling error 2. The acceptable level of confidence 3. The estimate of the population standard deviation 10 Sample Size Determination Ø • 3 Statistical Determinants of Sample Size DEGREE OF CONFIDENCE – Statistical Confidence – 95% Confidence or .05 Level of Significance Ø DEGREE OF PRECISION – Accuracy in Estimating Population Proportion – +/- $5.00 versus +/- $1.00 – +/- 10% versus +/- 5% Ø VARIABILITY IN THE POPULATION – To What Degree do the Sampling Units Differ We can choose an error range (e.g. + 5%) We can set a confidence level (e.g. 95%) But Without knowing the spread of results (i.e. the standard deviation for the population) we cannot work out the sample size required So How can we estimate the population standard deviation before selecting the sample: • pilot tests n = Z2σ2 • guess E2 • previous experience Z = level of confidence • Secondary data σ = population SD E = acceptable amount of sampling error Example ØNumber of fast food restaurant visits in past month ØWe need our estimate to be within 1/10 (.01) of a visit from the population average (E) ØWe need to be 95.44% confident that the true population mean falls in the interval defined by the sample mean plus or minus E (i.e. within 2 standard deviations) Z=2 ØStandard deviation – guess at 1.39 days n = Z2σ2 E2 = 22(1.39) 2 (01) 2 = 4(2.93) 2 = 7.72 .01 .01 = 772 11 Sample Size Determination To be More confident More precise If more variable Sample size must increase Too big - it’s a waste of money Too small - you cannot make a big decision Significance level In hypothesis testing, the significance level is the criterion used for rejecting the null hypothesis. The significance level is used as follows: First, the difference between the results of the experiment and the null hypothesis is determined. Then, assuming the null hypothesis is true, the probability of a difference that large or larger is computed. Finally, this probability is compared to the significance level. If the probability is less than or equal to the significance level, then the null hypothesis is rejected and the outcome is said to be statistically significant. Traditionally, experimenters have used either the .05 level (sometimes called the 5% level) or the .01 level (1% level), although the choice of levels is largely subjective. The lower the significance level, the more the data must diverge from the null hypothesis to be significant. Therefore, the .01 level is more conservative than the .05 level. The Greek letter alpha is sometimes used to indicate the significance level. 12 Critical value ØA critical value is the value that a test statistic must exceed in order for the the null hypothesis to be rejected. ØFor example, the critical value of t (with 12 degrees of freedom using the .05 significance level) is 2.18. ØThis means that for the probability value to be less than or equal to .05, the absolute value of the t statistic must be 2.18 or greater. critical value Significance level (.05) Test statistic α/2 α/2 -2.023 0 2.023 2.816 The t distribution ØThe t distribution is used instead of the normal distribution whenever the standard deviation is estimated. ØThe t distribution has relatively more scores in its tails than does the normal distribution. ØThe shape of the t distribution depends on the degrees of freedom (df) that went into the estimate of the standard deviation. ØAs the degrees of freedom increases, the t distribution approaches the normal distribution. ØWith 100 or more degrees of freedom, the t distribution is almost indistinguishable from the normal distribution. 13