Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
Central limit theorem wikipedia , lookup
Law of large numbers wikipedia , lookup
Misuse of statistics wikipedia , lookup
2. Random variables in psychology One way to think of a random variable, X, is as the unknown outcome of a repeatable experiment that you are about to perform, where the outcome can be recorded as some numerical measurement. Once the experiment is performed, the unknown outcome becomes known and is observed to take some particular value X = x. A random variable X is entirely described by its probability distribution, from which we can derive the frequency with which any particular value x will occur. Because many experiments in psychology involve sampling individuals from some large population, we will often refer to the distribution of X as the population distribution. 2.1 Discrete random variables. We begin with the case where the random variable X is discrete. This means that the outcomes from the experiment are constrained to be one of a set of discrete points S = {x1, x2, x3, .... } (known as the range of X) The set S could be finite or infinite, although in most practical situations, only finite sets will arise. 2.1.1. The probability mass function. A discrete random variable is completely defined by two things: the range (defined by S above) and the probability mass function (p.m.f.). The probability mass function assigns a probability to each outcome and can be written as fX(xi) = Pr( X = xi), i = 1, 2, 3, .... In practical terms, the p.m.f. tells us: the frequency with which the particular outcome would occur over a long sequence of identical experiments the appearance of the histogram that would be formed from the outcomes of a long sequence of identical experiments. An event E in an experiment is any subset of the outcome set S. The p.m.f. gives us the frequency of the event P(E) simply by summing up fX(xi) over those outcomes that lie in the subset E. We can also characterise random variables using the cumulative distribution function (the c.d.f.), FX(x). This gives the probability that Pr(X x) for any number x. It is therefore equal to the sum of the fX(xi) for those xi that are less than x. (We will tend not to use the c.d.f. too much in this course.) 2.1.2 Multivariate random variables and independence Suppose now that you do an experiment in which two measurements (X, Y) are to be taken. Then we can define the bivariate probability mass function as f(X, Y)(xi, yj) = Pr(X = xi, Y = yj) (= "the frequency with which X = xi and Y = yi") This specifies the frequency with which the pair of outcomes (xi, yj) will occur together over a long sequence of experiments. We say that X and Y are independent if f(X, Y)(xi, yi) = fX(xi)fY(yj). In practice X and Y being independent means that the frequency which yj occurs over the subsequence of experiments in which xi occurs is the same as in the entire sequence of experiment. These ideas generalise in a natural way to the situation where there are n random variables. 2.1.3 Expectation - mean and variance For any random variable, X, let g(X) be a function of X. Then the expectation of g(X) is defined to be E(g(X)) = fX(x1)g(x1) + fX(x2)g(x2) + ... fX(xn) g(xn). In practice E(g(X)) tells you the average value of g(X) over a long sequence of experiments. The main expectations that we shall consider are the mean and variance of a random variable. Let X denote a discrete random variable. Then the mean of X is X = E(X) = E(g(X)) = fX(x1)x1 + fX(x2)x2 + ... fX(xn) xn. The value of X: tells you the average value of X you should expect to get in a long sequence of random draws from X indicates a 'middling' value for the distribution. The variance is a measure of how variable the values of X will be over repeated sampling. Formally it is defined as 2 = Var(X) = E(X-X)2) = E(g(X)) = fX(x1) (x1 - X)2 + fX(x2) (x2 - X)2 + ... fX(xn) (xn - X)2. The variance gives a natural measure of how 'spread out' samples from the distribution will be around a central value. The standard deviation is the square root of the variance and denoted by . People often prefer to report the standard deviation rather than the variance because it is measured in the same units as the measurements x. 2.1.4. Some examples of discrete random variables The Binomial(n, p) random variable describes the number of successes, X, out of a set of n independent trials where the probability of success on any trial is p. It is therefore a natural one to use in experiments that involve measuring the performance of individuals on a repeated test with a dichotomous response. If X ~ Bin(n, p) then E(X) = np and var(X) = np(1-p). The Poisson random variable, X ~ Poisson(), is often used to count the number of events of a certain type that occurs in a fixed period (e.g. industrial accidents in a factory each month, hypoglaecaemic episodes experienced by someone with diabetes in a year). If X ~ Poisson() then E(X) = and Var(X) = . Note that the outcomes of very few complex, real-world systems and experiments conform to these standard distributions. However, they may provide useful approximations. In the above examples, n, p and are examples of parameters. They are quantities which must be specified before the distribution is fully defined. Much of parametric statistics is concerned with how we can estimate these quantities from finite samples from the distribution. 2.2 Continuous random variables In some experiments we may measure quantities that naturally can be considered to vary on a continuous scale (e.g. weight, height, reaction time, …..). It will be useful to model these quantities using continuous random variables. A continuous random variable, X, is characterised by a probability density function (p.d.f.), fX(x). Roughly speaking the p.d.f. gives the probability of an outcome falling in a small interval of width centred on a particular value x, as Pr(x-/2 < X < x + /2) = fX(x). More generally Pr(a < X < b) is given by the area underneath the p.d.f. between a and b (see diagrams in lectures). As in the case of a discrete random variable, the p.d.f. indicates the shape of the histogram that would formed from a large sample of random draws from a random variable. Note that, in reality, the outcome of any practical experiment could be modelled as a discrete random variable with a finite range, so that conceptually continuous random variables are not absolutely necessary in practice. (Why?) Nevertheless, it is useful to model the outcome of some experiments using continuous random variables. 2.2.1 Some common continuous random variables. The normal random variable (denoted N(, 2)) is the most common continuous random variable that we shall work with. If X follows a normal distribution then its range is (-∞, +∞) and its probability density function is given by f X x 1 2 2 e x 2 2 (See lectures for a sketch.) The distribution is specified by two parameters: (which turns out to be the mean) and 2 which is the variance. Many data sets you will encounter will look reasonably consistent with the normal distribution in that the values of measured quantities will look reasonably symmetrically distributed around a central value with the majority of observations close to the mean. The standard normal distribution is has mean 0 and variance 1. If X ~ N(, 2) then (X-)/ ~ N(0, 1), any normal random variable can be easily transformed to a standard normal. Even when observations – e.g. scores from cognitive test – are constrained to take integer values it may be acceptable to assume that they come from a normal distribution when analysing them statistically. (See Howell, Chapter 3 for a summary of the normal random variable and its important properties.) The Gamma(, ) distribution. This is a flexible distribution which takes values in the range (0, ∞). Its probability density function is given on the formula sheet and it has mean -1 and variance -2. The parameter is known as the shape parameter, since it determines the shape of the density function and the is known as the scale parameter. An important special case is the n2 distribution which is also Gamma(n/2, 1/2). The distribution is very important and arises naturally whenever we add up squared sums of independent normally distributed random variables. In fact, if Z1, …, Zn are independent N(0, 1), then 2 n Z 12 Z 22 ... Z n2 ~ n2 2.3. Correlation and covariance Let X and Y be two random variables. (For example, X might denote the age of a randomly chosen individual from the population and Y could be their score on a psychological test. ) Now let a and b be two scalars. Then we have that E(aX + bY) = aE(X) + bE(Y) = aX + bY. Also we have that E(X + a) = E(X) + a. It is not generally the case that E(XY) = E(X)E(Y). This is generally, true if X and Y are independent. This leads us to the definition of covariance Cov(X, Y) = E((X - X)(Y - Y)) = E(XY) – XY. If X and Y are independent, then Cov(X, Y) = 0. If Cov(X, Y) > 0 then it suggests that higher values of X tend to occur in combination with higher values of Y (e.g. higher scores associated with being older). Cov(X, Y) < 0 suggests that score tend to decrease with age. An important result is the following: Var(aX + bY) = a2Var(X) + b2Var(Y) + 2abCov(X, Y). One problem with using covariance as a measure of the association between X and Y is that depends on the scale on which these are measured. (e.g. X = ‘Age in months’ will give a different covariance with score, Y, than X = ‘Age in years’). Using the correlation avoids this problem. This is defined to be xy E X X Y Y XY . Correlations are constrained to take values between -1 and +1 with these extreme values corresponding to a perfect linear relationship between X and Y. Note that it is possible for X and Y to have zero correlation but, nevertheless, to be very strongly associated with each other. (See discussion in lecture.) 3. Estimating parameters and constructing confidence intervals. In Section 2 we’ve introduced the idea of random variables as theoretical models for the outcome of experiments that we may do. The probability function essentially gives us the shape of the histogram that we would observe given a sufficiently large number of samples. If we had a sufficiently large number of samples we could therefore identify the model distribution from which they were drawn (known as the population distribution) and identify the value of parameters with a high degree of accuracy. In such situations you don’t need statistics! Statistics all about how we draw conclusions about a population distribution when the size of our sample is limited and how we quantify the degree of certainty that we can have in these conclusions. We begin with a simple estimation problem. 3.1 Estimation of the mean and variance in the normal distribution Suppose that we believe that the score on an IQ test of males aged 18 in Scotland follows a Normal distribution with unknown mean and unknown variance 2. You take a random sample of size n from the set of 18 year olds and measure their IQs. We denote these values by x1, x2, …., xn. How can you estimate and 2 from these data? Answer: Natural estimators to use for and 2 are the sample mean and sample variance respectively. 1 xi , and the sample variance is n xi 2 1 2 xi . n 1 n The sample mean is x s2 1 x i x 2 n 1 These are both quantities that can be calculated from the data. Suppose that our sample is of size 10 and that xi 973 , and xi2 = 97822. Then we calculate the sample mean and sample variance to be x = 97.3 and s2 = 349.9 respectively, and our estimates for the population mean and population variance are ˆ 97.3 and ̂ 2 349.9. Note these values are estimates - we can’t claim that they are precisely equal to the population mean and variance because they are calculated from the data. If we carried out the experiment again we would obtain different estimates. (We will return to the question of how these estimates would vary between samples when we consider confidence intervals.) 3.2 Estimation of the parameters in a Gamma distribution. In a certain test of visual processing speed the time (in ms) taken by individuals to respond to a stimulus is believed to follow a Gamma(, ) distribution. You take a random sample of n individuals and measure their response times. How can you estimate the values of and from the data you obtain? As above, we can use the sample mean and variance to estimate the population mean and variance. We can then use the sample mean and variance as an estimate of the population mean and variance, -1 and -2, respectively. This gives estimates for and as ˆ x , and ˆ ˆx . s2 Both 3.1 and 3.2 are examples of method-of-moments estimation of parameters. Given a random sample from the population you estimate the population parameters by selecting them to define a distribution whose mean and variance match the sample mean and variance. Often this approach gives very intuitively very sensible results. Other examples of method-of-moments estimation will be discussed in the class. 3.3 Quantifying uncertainty in estimates – confidence intervals. So far we have seen how we extract estimates of population parameters from data by matching sample moments with the population moments to be estimated. These estimates take the form of a single value and do not give any indication of how precise they are. They are known as point estimates. Example: You wish to estimate the proportion of the population, p, who can roll their tongue. In a class of 10 students you find 4 who can and estimate p to be 0.4 (using method of moments). You colleague carries out a larger experiment and tests 1000 people randomly chosen from the population and finds 725 who can roll their tongue. Their estimate is 0.725. Where do think the true proportion might lie in the range (0, 1)? Clearly it would be helpful to a third party who might be interested in the value of p if you and your colleague were able to associate some margin of error with your respective estimates. The construction of confidence intervals is one way in which we can do this. 3.3.1. Constructing a confidence interval for in N(, 2) where 2 is known (a somewhat artificial situation!!). Suppose you observe a random sample x1, …, xn from a normally distributed population with unknown mean and known variance 2. We can calculate an estimate of as ̂ x . We now wish to associate some measure of accuracy with this estimate. The need for this arises because the sample mean will actually vary from sample to sample. We therefore need to understand the sampling distribution of the sample mean, which describes how it will vary from experiment to experiment. This requires us to consider the sample mean as a random variable, i.e. X 1 Xi . n IMPORTANT FACT: It can be shown (see tutorial) that if Xi ~ N, 2), then 2 . Note that variance is inversely proportional to the sample size. X ~ N , n Note that this means X ~ N (0, 1) . n It follows that in 95% of experiments (i.e. random samples of size n), X will lie n between -1.96 and +1.96. After some algebra (see lecture) we can show that in 95% of experiments, the value of will lie between X 1.96 and X 1.96 n n limits of a 95% confidence interval for . . These two values define the lower and upper Note that a confidence interval is a random interval that will vary from experiment to experiment and will cover the true value of (in the long-run) 95% of the time. For any given experiment with observed sample mean, x one reports the observed confidence interval as ( x 1.96 , x 1.96 ). Since this has been obtained via a recipe which gives an n n interval covering 95% of the time, then you claim to be 95% confident that this interval contains the value of . (See class exercise on coverage properties of confidence intervals, construction of CIs for differing degrees of confidence.) 3.3.2. Constructing a confidence interval for in N(, 2) where 2 is unknown. This is much more realistic than 3.3.1. Again we assume that the data are a random sample x1, …, xn from a normally distributed population with unknown mean and unknown variance 2. To get a confidence interval for we can't just apply the above method since it relied on knowing 2. Instead we replace unknown 2 with the estimate from the sample. Now from standard statistical theory, we are able to identify the sampling distribution of X ~ t n 1 s n where tn-1 denotes the t distribution on (n-1) degrees of freedom. (See lectures for a picture of what the probability density of the t density looks like). Going through a similar argument to 3.3.1, it follows that in 95% of experiments (i.e. random X samples of size n), will lie between -tn-1(2.5) and + tn-1(2.5). This leads to a 95% s n confidence interval of the form ( x t n 1 2.5 n , x t n 1 2.5 n ). The value of t n 1 2.5 depends on the value of n. As the sample size increases it tends to 1.96 from above reflecting the fact that the tn-1 distribution looks very much like the N(0, 1) distribution An example: We now illustrate how the method of 3.3.2 can be applied in a real situation. Suppose a random sample of 8 students undertake a test of cognitive skill under following 24 hours abstinence from caffeine. One week later they repeat the test one hour after taking a certain dose of caffeine. The scores achieved by the students in the two instances are recorded in the following table. Student No caffeine Caffeine Difference 1 34 37 3 2 55 51 -4 3 23 28 5 4 34 33 1 5 41 49 8 6 42 41 -1 7 31 32 1 8 40 46 6 Assume that the results for the 8 students are independent of each other. Further assuming that the differences in performance due to caffeine use are normally distributed, find a 95% confidence interval for the population mean difference in score, D, when caffeine is used. Let the differences in score be denoted by d1, ..., d8. Now for these data d = 19 and d2 = 153. From these data we calculate the sample mean and variance to be d 2.375 and s 2 15.41 . Taking the square root of the sample variance we obtain the sample standard deviation for the difference to be s = 3.93. Now we need to obtain the 2.5%-point of the t distribution on 7 (= n-1) degrees of freedom. From the table this is 2.365. We then obtain our 95% confidence 3.93 3.93 interval as (2.375 2.365 , 2.375 2.365 ) or (-0.91, 5.66). 8 8 Criticise the design of this experiment. Why does it give little real evidence about the effect of caffeine on performance? This method can be applied to give a confidence interval for even when the population distribution is not normal, so long as the sample size is sufficiently large. For samples size of 30 or so, the confidence interval can be considered to be valid regardless of the exact nature of the population distribution. 3.3.3 Constructing a confidence interval for 2 in the normal distribution On occasions we will be interested in obtaining a confidence interval for the population variance, 2, from a random sample of size n from a N(, 2) distribution where and 2 are both unknown. This may be useful, for example, when we wish to consider whether the variability in one sample is different from another, or when the variability itself is an important characteristic. Confidence intervals can be constructed using the 2 distribution as follows. The method is based on the fact that the sampling distribution of the sample variance S2 is known from standard theory. Specifically we know that: (n 1) S 2 2 X i X ~ n21 2 (This is a standard result). Now this implies that in 95% of experiments (n 1) S 2 2 will lie between the 97.5% and the 2.5% points of the n21 distribution. (See diagram in lectures). After some algebra we can show that in 95% of experiments s2 will lie in the interval (n 1) S 2 (n 1) S 2 ( 2 , ). n 1 2.5 n21 97.5 When we carry out our experiment we can substitute the observed value of s2 into this formula to get the observed confidence interval. For example, in the case where s2 is the population variance for the difference in score when caffeine is taken (3.3.2), we have that n = 8, s2 = 15.41, 72 (97.5) 1.69 , 72 (2.5) 16.01 (from tables). Substituting these values into the above confidence interval (6.73, 63.8) for s2 and (on taking the square root) we obtain a 95% confidence interval for s as (2.59 , 7.98). It is clear with such a small sample that we get poor accuracy in our estimate of s, and this is reflected in the width of the confidence interval. 3.3.4. Confidence intervals for p in the binomial distribution A common problem in the social sciences is that of estimating a binomial proportion p. That is we take a random sample of size n from a population and count the number, X, in the population who have a given property. From this we wish to estimate the proportion, p, of the entire population who share the property. So long as the total population size is large compared to n, then X ~ Bin(n, p). The most natural estimate to use for p is pˆ X . n We can get a confidence interval for p in the case where n is large by using the standard result that (approximately) p pˆ ~ N (0,1) . pˆ (1 pˆ ) n pˆ (1 pˆ ) pˆ (1 pˆ ) , pˆ 2 ) . This is the CI n n which is typically calculated for opinion polls when the proportion of voters who e.g. intend to vote Labour is estimated. This gives a 95% CI for p of which is ( pˆ 2 For the tongue-rolling example at the start of 3.3, the observation that 725 out of 1000 could roll their tongue would naturally lead to an estimate of p̂ = 0.725, and a confidence interval 0.725 0.028. That is we expect our estimate of the percentage of the population who can roll their tongue to be accurate to within around 3%. 4. Testing Hypotheses and calculating p-values (1-sided and 2-sided tests) While it is generally a good thing to calculate a confidence interval for a parameter since this gives an indication of the range of possible values that it can take, many scientists and statisticians opt to calculate p-values to quantify the strength of evidence that the data carry about a null hypothesis. In the first section of the course we encountered a simple case of a null hypothesis that a coin was fair, P(H) = 0.5. In psychology we are usually concerned with null hypotheses, termed H0, that state e.g. that a given treatment has no effect in an experiment (e.g. the caffeine experiment of the last section) that there is no difference in the distribution of some measurement between two populations (e.g. maths scores between Edinburgh & Glasgow students) Many statisticians do not like hypothesis testing on the grounds that we shouldn't expect any null hypothesis to be true. If we can't reject it, it reflects the fact that we didn't collect enough data to reject the hypothesis, rather than the fact that the hypothesis is true. Nevertheless, hypothesis testing remains an important part of statistical methodology which is much used in psychology. As described in the coin tossing example of section 1, hypothesis testing involves several steps including identifying a so-called test statistic (in that case the number of heads out of 20 tosses) which will be used to measure how far a given experimental outcome deviates from what would be expected if the null hypothesis were true. The distribution of that test statistic must be known so that the frequency with which more extreme values than the current one will occur when H0 is true can be calculated. Example. Consider the caffeine example of 3.3.2. Suppose that we wish to test the null hypothesis H0: D = . Then a suitable test statistic to measure how extreme our X experimental results are compared to H0 is the t-statistic, (see lecture for discussion). s n For the case, H0: D = which corresponds to no effect of caffeine on performance, and the data shown in 3.3.2 we would calculate this value to be 1.71. Now under H0 we know that X is distributed as t7 (t distribution on 7 degrees of freedom). s n Whether we carry out a 1-sided or 2-sided test really depends on the alternative hypothesis that we are considering. If we wish to compare H0 with the alternative hypothesis H0: D > 0, (i.e. caffeine improves performance), then we would calculate a p-value Pr(t7 ≥ 1.71) = 0.0665 If we were comparing against the general alternative that D might be greater or less than 0, then we would need to calculate our p-value as Pr(t7 ≥ 1.71) + Pr(t7 -1.71) = 20.0665 = 0.133. In general you must always state which test statistic and which test you're going to do before you collect the data.