Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MSc Regulation and Competition Quantitative Techniques QT week 6: Sampling distributions In this lecture Inference Population versus sample Sampling distributions Distribution of the sample mean Likelihoods and probabilities Inference So far we have been looking at how different kinds of casino-like data generating processes can produce data. Statistical inference is the art of looking at the data and making guesses about the underlying population or data generating process. Figure 6.1 The process of inference involves picking a box… Population or Data generation process 1 Population or Data generation process 2 ? Observations or data (sample) ? ? ? Population or Data generation process 3 Population or Data generation process 4… Two key aspects of statistical inference are estimation and hypothesis testing. Before we get on to those we have to step back a little. Populations A rough definition of a population: “Set of all things under consideration” e.g. However, the term is used in two slightly different senses – a) the set of individual items in which we are interested and b) one or more measured attributes of that set of items. City University 1 MSc Regulation and Competition Table 6.1: Examples of populations Group People entitled to vote in the next election All past and future students of City University Potential applicants to course X Grains of sand in a bucket Fish in the North Sea Owners of Porsche 911s Present and future members of the human species Outcomes of random number generating process Quantitative Techniques Attribute e.g. Height; voting intention Ownership of laptops Current country of residence Mass; hardness Mercury levels IQ score Length of life; ability to jump Value The attribute can be any kind of data discussed in Lecture 1. The population can be finite or infinite, as can the set of possible outcomes. A “population” in the normal sense can be regarded as a sample in the statistical sense. The people currently living on Orkney might be taken as a sample of all people living north of Edinburgh. Samples The main purpose of discussing populations is to distinguish them from samples. Normally samples are all we observe, and an important job of statistics is to make inferences about attribute of the population as a whole from measured characteristics of the sample. This is one definition of statistical inference. Sampling can be systematic or random. With simple random sampling every member of the population has the same chance of being chosen. We shall focus on the characteristics of simple random sampling. There are several other approaches to sampling. For example, stratified sampling can, if done properly, give more accurate results than simple random sampling. The opposite of random sampling is selective sampling. For example, one’s sample might be based only on people passing Piccadilly Circus between 3 and 4 pm on Friday 13th. Or people may select themselves for interview etc. Selective sampling potentially gives rise to selection bias. Sometimes quota sampling is used as a means of both stratifying a sample and reducing the selection bias that arises because random sampling may be difficult to achieve in practice. Survey specialists need to understand these biases and design their surveys accordingly. We shall ignore (“abstract from”) these problems for the time being. City University 2 MSc Regulation and Competition Quantitative Techniques Sampling distributions Things like the means and standard deviations of samples have the properties of probability distributions and density functions. These are called sampling distributions. Suppose we take a sample of size 1 from a population. How will it be distributed? Exactly like the original population. Now suppose we take a sample of size 2. Each member will be distributed like the population. But how will the sample mean be distributed? How will the mean be distributed as the sample size increases? The Central Limit Theorem says that, as the sample size n increases these means will be distributed: a) approximately normally b) with a mean equal to the population mean c) with a variance equal to the population variance/n. Lab 5 and Coursework 1 ask you to investigate the Central Limit theorem empirically using the skills you have built up so far in the labs. We can use this knowledge to work out confidence intervals. e.g. Suppose our sample estimate of the IQ level of a group of 16 students in a single year of the course is 120 and we know that the standard deviation is 10 we can get an estimate of the 95% confidence interval for the population from which the students are drawn. This is the process: the explanation will follow. 1. Take the sample mean Xbar. 2. Get the population variance and divide by n. Take the square root to get the standard deviation sxbar. 3. Find the critical values z of the Normal distribution from the Tables e.g. Appendix B1 in Ashenfelter et all. In this case 1.96. 4. Then the 95% confidence interval goes from Xbar - z.sxbar to Xbar + z.sxbar In the present case Xbar =120 z =1.96 sxbar = 10/√16 = 2.5 City University 3 MSc Regulation and Competition Quantitative Techniques z.sxbar = 1.96 x 2.5 = 4.9 So the 95% confidence interval runs from 115.1 to 124.9 Supplementary question: what is “the population from which the students are drawn”. Likelihoods and probabilities Although this looks straightforward enough, strictly speaking we are talking about likelihoods rather than probabilities here. Econometricians make more of a distinction here than normal people. Roughly speaking, a probability refers to an event which has not happened and which therefore can be influenced by random factors. A likelihood refers to something that has happened or is not subject to future random factors, but about which we have incomplete knowledge. Example: Consider the table you may have produced in lab 4 showing the probabilities of different numbers of heads in five tosses depending on the bias in the coin. Table 6.2 The binomial table as a function g(h,p) Likelihood ↓ p (h) 0 0.1 0.2 0.3 0 1 0.59049 0.32768 0.16807 1 0 0.32805 0.4096 0.36015 0.4 0.5 0.6 0.7 0.8 0.9 1 0.07776 0.03125 0.01024 0.00243 0.00032 1E-05 0 0.2592 0.15625 0.0768 0.02835 0.0064 0.00045 0 Number of heads h 2 3 0 0 0.0729 0.0081 0.2048 0.0512 0.3087 0.1323 0.3456 0.3125 0.2304 0.1323 0.0512 0.0081 0 0.2304 0.3125 0.3456 0.3087 0.2048 0.0729 0 4 0 0.00045 0.0064 0.02835 5 0 0.00001 0.00032 0.00243 0.0768 0.15625 0.2592 0.36015 0.4096 0.32805 0 0.01024 0.03125 0.07776 0.16807 0.32768 0.59049 1 Probabilities → We can view Table 6.2 in two ways: a) going along the rows, for a given value of p, what is the probability associated with a given number of heads out of five i.e. g(h|p) ; b) going down the columns, for a given number of heads, what is the likelihood that it was generated by a coin with a particular characteristic p ? i.e. g(p|h). The latter is sometimes written as L(p|h). (L for likelihood). Question: if I observe two heads out of five, what value of p maximises the likelihood function? City University 4 MSc Regulation and Competition Quantitative Techniques Question: For the above example of the confidence interval, identify and draw a) a probability distribution when = 115.1 b) a relevant likelihood function when Xbar =120. Reading Ashenfelter et. al. Chapter 6 . Salvatore and Reagle Chapter 4. Exercises Ashenfelter 6.11 numbers 1, 2, 3,4, 5, 6, 8. City University 5