* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 2 handout - The University of Reading
Survey
Document related concepts
Transcript
PYPR1 lecture 2 : Populations and Samples Dr David Field General Information This lecture contains material that is crucial for understanding the rest of the course – read the text book • important sections are indicated by, e.g. – go to the workshop – download the lecture – make use of the university maths support service • Specialist statistics tutor available every Wednesday afternoon in term time from 2.00pm-4.00pm • Alternatively, in a form with your question on the website and get a reply by email • http://www.reading.ac.uk/mathssupport/ Populations and samples • At the end of this lecture we will be able to make statements of the form – “We measured the number of hours slept per night of a sample of 50 students in the UK. The mean number of hours slept was 7.2. We can be 95% confident that the population mean lies between 6.8 and 7.6 hours of sleep per night.” • We will aim to understand the logic underlying the confidence interval around the mean in the above statement • This is based on the properties of normal distributions and sampling distributions Populations and samples • If we weighed every adult domestic cat in the UK we would be able to calculate the population mean weight and the population SD – other populations of interest might be engineering students or amateur cricketers or cars • Measuring the whole population is expensive and impractical, and so scientists invariably measure only a fraction of the population of interest, using sampling • The aim of the sampling procedure is to obtain as good an estimate of the unknown true population statistics as possible • This requires the relationship between the sample you have and the unknown population to be quantified – If I weigh 100 cats, how confident can I be that my observed mean is close to the unobservable population mean? Representative and unrepresentative samples • We can only assess the relationship between a sample and an unobservable population if the sample is representative of the target population • This is an issue of study design, but it determines how broadly we can interpret our numeric statistics • If a sample of engineering students was selected exclusively from Oxford University then measures obtained from it might not be an accurate reflection of engineering students in general • There are a number of ways to obtain a representative sample, the simplest case being random selection from the entire population What does random mean? • Each time you sample a single case, every member of the underlying population had an equal chance of being selected • The classic case is rolling an unbiased dice • Each time the die is rolled you have an equal chance of the result being 1,2,3,4,5, or 6 • There are no history effects! • If you rolled 600,000 dice and recorded the results you would end up with very close to 100,000 occurrences of each outcome – What would the frequency histogram of this data look like? • In Psychology we often use opportunity samples and treat them as if they were random samples from a target population Key concept: the normal distribution • Values close to the mean of a variable are often more frequent than values far from the mean – This is true of the height of adults – It is not true of rolling a dice repeatedly • When true, and if you sample randomly from the population, this produces a bell shaped frequency histogram • Many psychological variable are normally distributed, e.g. IQ – in the case of the IQ test, it is designed to be like that • Normal distributions can be visualised using frequency histograms just like the ones from Lecture 1 UK cats Mean 5 Kg SD 0.8 Kg Carl Frederick Gaus (1777-1855) Curve shape is independent of sample size UK cats Mean 5 Kg SD 0.8 Kg The standard deviation - revision scores (pints) deviations squared deviations 1 4 5 6 9 11 -5 -2 -1 0 3 5 25 4 1 0 9 25 • The sum of the squared deviations is 64 • The mean deviation (variance) is therefore 64 /(6 – 1) = 12.8 • Square rooting the variance returns it to the original measurement units of the variable • Therefore the SD is 3.57 pints Greek cats Mean 3.55 Kg SD 0.4 Kg UK cats Mean 5.0 Kg SD 0.8 Kg • A Greek cat weighing 3.95 Kg and a British cat weighing 5.8 Kg are clearly different from each other in important ways, including their weight • But, to a statistician, they are identical in one important respect: – they occupy the same position in their respective sample distributions – relative to their sample means they are both equally “unusual” occurrences – Both can be described by “Mean + 1SD” – If you randomly select 1 cat from the 10,000 Greek and the 10,000 British cats then the probability of sampling a 5.8 Kg British cat is equal to the probability of sampling a 3.95 Kg Greek cat 6800 Greek cats UK cats Mean 5.0 Kg SD 0.8 Kg • Recall the student sleep example from the start of the lecture, and the 95% confidence interval around the mean number of hours slept? • For any population that is normally distributed 95% of all scores will fall within 1.96 SD either side of the mean • This means that if you randomly select one case, it’s score is 95% likely to fall within 1.96 SD of the mean • The confidence interval is based on the properties of normal distributions, but there are some intermediate steps to understand The standard normal distribution and z scores • The family of normal distributions has an infinite number of members, each defined by a unique combination of mean and SD • There is one particular normal distribution, called the standard normal distribution, which has a special status – It has a mean of 0 – It has a SD of 1 The standard normal distribution and z scores • One useful thing about the standard normal distribution is that scores from any other normal distribution can be converted into scores on the standard normal distribution • The converted scores are called z scores • The new scores loose their original units (e.g. Kg), and are now expressed in units of SD • This is useful for comparing between samples – If you know the z score for a Greek cat and for a British cat you see directly which cat is a relatively heavier example of it’s own population Calculating z scores • • • • • • • z = (score – sample mean) / SD of sample z score for a British cat weighing 4.7 Kg (4.7 – 5.0) / 0.8 = -0.62 z score for a Greek cat weighing 4.7 Kg (4.7 – 3.5) / 0.4 = 2.88 The Greek cat is a very large cat The British cat is fairly typical, perhaps slightly on the small side What happens when the sample is small? • In Psychology we usually work with small samples, and we often have little idea of the underlying population parameters • With a small sample, you can still calculate a mean and an SD, although the sample might be too small to assess whether the underlying population is normally distributed • Lying at the heart of statistics is the question of the relationship between populations and samples • To explore this, we can use examples where population parameters are known Population: Mean 5 Kg SD 0.8 Kg Sample of 10: Mean 4.6 Kg SD 0.7 Kg Confidence in the sample mean • Given a sample, I can produce a sample mean • Statistics is about describing the relationship between measured samples and underlying populations – A statistician will ask how good the sample mean is as an indicator of the population mean • If you have a large and representative sample, then the sample mean is such a good estimate of the population mean that it can be used interchangeably • But often we have a small sample, and no population data (or large sample) to compare it with • We can use the cats example, treating the full sample of 10,000 as the population, to explore the relationship between small samples and the population • A key point is that the mean of a large sample is more likely to lie close to the population mean than the mean of a small sample Quantifying confidence in the sample mean • The sample mean is a point estimate of the population mean – If we collected a second random sample we would have two different point estimates of the population mean – By definition, they can’t both be correct – This situation implies an underlying continuum on which point estimates can lie – A continuum can be thought of as a curve plotted on a graph, like the normal distributions you saw earlier • In statistics, we aim to convert the sample point estimate into an interval estimate – This is a range on the underlying continuum – We want to be able to say that given the sample, the population mean lies somewhere between X and Y – We will have to be content to say that we are 95% sure Where each black line crosses the x axis represents a separate point estimate of the population mean The horizontal line is a visually judged interval estimate of the population mean given the 10 samples Mean 5 Kg SD 0.8 Kg Sampling distributions • Each sub-sample of 10 cats has a mean weight and SD that is slightly different from the full population of 10,000 • Individual samples of 10 are often not normally distributed • Imagine collecting 100 separate sub-samples of 10 cats, and producing 100 sample means • The mean of the 100 sample means would be an excellent estimate of the population mean • It is possible to plot a frequency histogram of the 100 obtained sample means Sampling distributions • The frequency histogram of the 100 samples of 10 will itself be normally distributed – This means the distribution can be defined by its mean and SD • More generally, if you collect a sample, then theoretically speaking there is an underlying population of samples, of which yours is just one case – The population of sample means is normally distributed Sampling distributions • In the frequency histogram of the original sample, each case was a single cat • In the corresponding sampling distribution, each case is the mean weight of (10) cats selected randomly from the cat population • A key property of sampling distributions is that provided the individual samples have N >= 30 they are ALWAYS normally distributed – even if the actual population is skewed (e.g. reaction time) – or bimodal – if population is normal sampling distribution will also be normal for small samples < 30 • The black curves are frequency histograms of the means of samples randomly selected from the pink population distribution Populations and samples • If we had enough samples to plot a sampling distribution, then for one sample we could assess exactly how close it is to the population mean (mean of the sampling distribution) • But what if we only have one sample? – With only one sample, because sampling distributions are normally distributed we still know that the mean of the single sample is 95% likely to fall within 1.96 SD either side of the mean of a theoretical sampling distribution • Avoid confusion: – the mean of a sampling distribution IS equal to the population mean – the SD of a sampling distribution is NOT equal to the SD of the population Standard error (SE) • In statistics, the standard deviation of a sampling distribution is given a different name to distinguish it from the standard deviation of a single sample or the standard deviation of a population • It is called the standard error (SE) • Its name contains the word “error” because statisticians use it to estimate measurement error Standard error (SE) • We can be 95% sure that a sample mean will lie within + / - 1.96 SE of the mean of the distribution of sample means – this provides the 95% confidence interval from the example about how long students sleep given at the start of the lecture – 7.2 hours – 0.4 hours (1.96 * SE) = 6.8 hours is the lower bound – 7.2 hours + 0.4 hours (1.96 * SE) = 7.6 hours is the upper bound • But at the moment we have only seen how to calculate the SE if you have collected a huge number of samples of a specific size • Therefore, the problem to solve is how to find the SE from a single sample • The starting point for solving this problem is the fact that the SE of the sampling distribution shrinks as the individual samples making up the distribution increase in size – This in turn is because the mean of a large sample is more likely to be close to the population mean than the mean of a small sample • The SE of the sampling distribution is smaller when the individual samples making up the distribution are larger Standard error and confidence intervals • The previous slide illustrates that the confidence interval around a sample mean will be smaller if the sample size is bigger – smaller confidence intervals imply more accurate and therefore more useful measurements • There is a lawful relationship between the sample size and the SE of the resulting sampling distribution – the SE is halved when the sample size is quadrupled • The SE is also dependent upon the SD of the population the samples were drawn from – the SE is smaller when the population SD is smaller – As you don’t know the SD of the population, the sample SD is used as an estimate of the population SD Standard error and confidence intervals • This relationship between the SE, sample size, and the SD is captured in the following formula SE = SD of the sample Sample size Standard error and confidence intervals • Standard error (SE) = sample SD / square root of sample size • For the small sample of 10 UK cats we observed a SD of 0.7 kg • 0.7 / square root of 10 (3.16) = 0.22 Kg • One more step is required to arrive at a 95% confidence interval • The standard error is 1 SD of the sampling distribution • What proportion of samples have means that lie within 1SE of the mean of the sampling distribution? It is 68% likely that the population mean falls within 1 standard error above or below the sample mean Standard error and confidence intervals • By convention, we usually want to make the statement that we are 95% certain that the population mean lies between X and Y • Therefore, we use the properties of normal distributions, which tells us that 95% of all samples in the sampling distribution of the mean fall within 1.96 SE of the mean • The confidence interval we give around the sample mean is 1.96 * SE either side of the mean Standard error and confidence intervals • The mean of the 10 cat small sample was 4.6Kg (SD 0.7). This gives a standard error of 0.22 Kg • Confidence interval = 1.96 * 0.22 = 0.43 Kg either side of the mean – “We measured the weight of 10 adult cats in the UK. The mean weight of the sample was 4.6 Kg. We can be 95% confident that the population mean weight of adult cats in the UK lies between 4.16 and 5.03 Kg.” • The mean of the population of 10,000 cats was 5.0 Kg, and we can see that the above statement is true (but only just in this case!) Standard error and confidence intervals • It is important to remember that we accept a 5% risk that the interval estimate is wrong, and the population mean does not lie within the range of values it defines • This is the risk that the sample we have is located in one of the two tails of the underlying sampling distribution • Imagine we draw a second sample, this time of 40 UK adult cats, from the population of 10,000 • This time, we obtain a mean weight of 4.87 Kg, with a SD of 0.84 Kg • This SD is bigger than the SD of the small sample, which was 0.7, so in a sense this sample is showing greater variation Standard error and confidence intervals • You might expect that the SE and confidence interval will also be larger for this sample • But, because the SD formula involves dividing by the square root of the sample size it turns out that this is not the case • Standard error is 0.835 / square root 40 (6.32) = 0.13 • Confidence interval is 0.13 * 1.96 = 0.258 Kg either side of the mean • Confidence interval for the small sample was 0.43 Kg either side of the mean List of statistical terms for revision • Note that the technical meaning of terms in statistics is not always the same as the everyday meaning of the words. If you understand each of these concepts then you are well on the way to understanding statistics! • population • sample • random • normal distribution (also known as a bell curve or Gaussian distribution) • frequency • standard normal distribution • z score • sampling distribution • standard error • confidence interval