Download Statistics for the Social Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia, lookup

Degrees of freedom (statistics) wikipedia, lookup

History of statistics wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Confidence interval wikipedia, lookup

Taylor's law wikipedia, lookup

Central limit theorem wikipedia, lookup

Law of large numbers wikipedia, lookup

Misuse of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Transcript
2. Random variables in psychology
One way to think of a random variable, X, is as the unknown outcome of a repeatable
experiment that you are about to perform, where the outcome can be recorded as some
numerical measurement. Once the experiment is performed, the unknown outcome becomes
known and is observed to take some particular value X = x. A random variable X is entirely
described by its probability distribution, from which we can derive the frequency with which
any particular value x will occur. Because many experiments in psychology involve
sampling individuals from some large population, we will often refer to the distribution of X
as the population distribution.
2.1 Discrete random variables.
We begin with the case where the random variable X is discrete. This means that the
outcomes from the experiment are constrained to be one of a set of discrete points
S = {x1, x2, x3, .... } (known as the range of X)
The set S could be finite or infinite, although in most practical situations, only finite sets will
arise.
2.1.1. The probability mass function. A discrete random variable is completely defined by
two things: the range (defined by S above) and the probability mass function (p.m.f.). The
probability mass function assigns a probability to each outcome and can be written as
fX(xi) = Pr( X = xi), i = 1, 2, 3, ....
In practical terms, the p.m.f. tells us:


the frequency with which the particular outcome would occur over a long sequence of
identical experiments
the appearance of the histogram that would be formed from the outcomes of a long
sequence of identical experiments.
An event E in an experiment is any subset of the outcome set S. The p.m.f. gives us the
frequency of the event P(E) simply by summing up fX(xi) over those outcomes that lie in the
subset E.
We can also characterise random variables using the cumulative distribution function (the
c.d.f.), FX(x). This gives the probability that Pr(X  x) for any number x. It is therefore equal
to the sum of the fX(xi) for those xi that are less than x. (We will tend not to use the c.d.f. too
much in this course.)
2.1.2 Multivariate random variables and independence
Suppose now that you do an experiment in which two measurements (X, Y) are to be taken.
Then we can define the bivariate probability mass function as
f(X, Y)(xi, yj) = Pr(X = xi, Y = yj) (= "the frequency with which X = xi and Y = yi")
This specifies the frequency with which the pair of outcomes (xi, yj) will occur together over a
long sequence of experiments.
We say that X and Y are independent if f(X, Y)(xi, yi) = fX(xi)fY(yj). In practice X and Y being
independent means that the frequency which yj occurs over the subsequence of experiments
in which xi occurs is the same as in the entire sequence of experiment. These ideas generalise
in a natural way to the situation where there are n random variables.
2.1.3 Expectation - mean and variance
For any random variable, X, let g(X) be a function of X. Then the expectation of g(X) is
defined to be
E(g(X)) = fX(x1)g(x1) + fX(x2)g(x2) + ... fX(xn) g(xn).
In practice E(g(X)) tells you the average value of g(X) over a long sequence of experiments.
The main expectations that we shall consider are the mean and variance of a random
variable.
Let X denote a discrete random variable. Then the mean of X is
X = E(X) = E(g(X)) = fX(x1)x1 + fX(x2)x2 + ... fX(xn) xn.
The value of X:


tells you the average value of X you should expect to get in a long sequence of
random draws from X
indicates a 'middling' value for the distribution.
The variance is a measure of how variable the values of X will be over repeated sampling.
Formally it is defined as
2 = Var(X) = E(X-X)2) = E(g(X)) = fX(x1) (x1 - X)2 + fX(x2) (x2 - X)2 + ... fX(xn) (xn - X)2.
The variance gives a natural measure of how 'spread out' samples from the distribution will
be around a central value. The standard deviation is the square root of the variance and
denoted by . People often prefer to report the standard deviation rather than the variance
because it is measured in the same units as the measurements x.
2.1.4. Some examples of discrete random variables
The Binomial(n, p) random variable describes the number of successes, X, out of a set of n
independent trials where the probability of success on any trial is p. It is therefore a natural
one to use in experiments that involve measuring the performance of individuals on a
repeated test with a dichotomous response. If X ~ Bin(n, p) then E(X) = np and var(X) =
np(1-p).
The Poisson random variable, X ~ Poisson(), is often used to count the number of events of
a certain type that occurs in a fixed period (e.g. industrial accidents in a factory each month,
hypoglaecaemic episodes experienced by someone with diabetes in a year). If X ~
Poisson() then E(X) =  and Var(X) = .
Note that the outcomes of very few complex, real-world systems and experiments conform to
these standard distributions. However, they may provide useful approximations.
In the above examples, n, p and  are examples of parameters. They are quantities which
must be specified before the distribution is fully defined. Much of parametric statistics is
concerned with how we can estimate these quantities from finite samples from the
distribution.
2.2 Continuous random variables
In some experiments we may measure quantities that naturally can be considered to vary on a
continuous scale (e.g. weight, height, reaction time, …..). It will be useful to model these
quantities using continuous random variables. A continuous random variable, X, is
characterised by a probability density function (p.d.f.), fX(x). Roughly speaking the p.d.f.
gives the probability of an outcome falling in a small interval of width  centred on a
particular value x, as
Pr(x-/2 < X < x + /2) = fX(x).
More generally Pr(a < X < b) is given by the area underneath the p.d.f. between a and b (see
diagrams in lectures).
As in the case of a discrete random variable, the p.d.f. indicates the shape of the histogram
that would formed from a large sample of random draws from a random variable.
Note that, in reality, the outcome of any practical experiment could be modelled as a discrete
random variable with a finite range, so that conceptually continuous random variables are not
absolutely necessary in practice. (Why?) Nevertheless, it is useful to model the outcome of
some experiments using continuous random variables.
2.2.1 Some common continuous random variables.
The normal random variable (denoted N(, 2)) is the most common continuous random
variable that we shall work with. If X follows a normal distribution then its range is (-∞, +∞) and its probability density function is given by
f X x  

1
2
2
e
 x   2
2
(See lectures for a sketch.)
The distribution is specified by two parameters:  (which turns out to be the mean) and 2
which is the variance. Many data sets you will encounter will look reasonably consistent
with the normal distribution in that the values of measured quantities will look reasonably
symmetrically distributed around a central value with the majority of observations close to
the mean.
The standard normal distribution is has mean 0 and variance 1. If X ~ N(, 2) then (X-)/
~ N(0, 1), any normal random variable can be easily transformed to a standard normal.
Even when observations – e.g. scores from cognitive test – are constrained to take integer
values it may be acceptable to assume that they come from a normal distribution when
analysing them statistically.
(See Howell, Chapter 3 for a summary of the normal random variable and its important
properties.)
The Gamma(, ) distribution. This is a flexible distribution which takes values in the range
(0, ∞). Its probability density function is given on the formula sheet and it has mean -1 and
variance -2. The parameter  is known as the shape parameter, since it determines the
shape of the density function and the  is known as the scale parameter.
An important special case is the  n2 distribution which is also Gamma(n/2, 1/2).
The
 distribution is very important and arises naturally whenever we add up squared sums of
independent normally distributed random variables. In fact, if Z1, …, Zn are independent
N(0, 1), then
2
n
Z 12  Z 22  ...  Z n2 ~  n2
2.3. Correlation and covariance
Let X and Y be two random variables. (For example, X might denote the age of a randomly
chosen individual from the population and Y could be their score on a psychological test. )
Now let a and b be two scalars. Then we have that
E(aX + bY) = aE(X) + bE(Y) = aX + bY.
Also we have that E(X + a) = E(X) + a.
It is not generally the case that E(XY) = E(X)E(Y). This is generally, true if X and Y are
independent. This leads us to the definition of covariance
Cov(X, Y) = E((X - X)(Y - Y)) = E(XY) – XY.
If X and Y are independent, then Cov(X, Y) = 0. If Cov(X, Y) > 0 then it suggests that higher
values of X tend to occur in combination with higher values of Y (e.g. higher scores
associated with being older). Cov(X, Y) < 0 suggests that score tend to decrease with age.
An important result is the following:
Var(aX + bY) = a2Var(X) + b2Var(Y) + 2abCov(X, Y).
One problem with using covariance as a measure of the association between X and Y is that
depends on the scale on which these are measured. (e.g. X = ‘Age in months’ will give a
different covariance with score, Y, than X = ‘Age in years’). Using the correlation avoids this
problem. This is defined to be
 xy 
E  X   X Y  Y 
 XY
.
Correlations are constrained to take values between -1 and +1 with these extreme values
corresponding to a perfect linear relationship between X and Y. Note that it is possible for X
and Y to have zero correlation but, nevertheless, to be very strongly associated with each
other.
(See discussion in lecture.)
3. Estimating parameters and constructing confidence intervals.
In Section 2 we’ve introduced the idea of random variables as theoretical models for the
outcome of experiments that we may do. The probability function essentially gives us the
shape of the histogram that we would observe given a sufficiently large number of samples.
If we had a sufficiently large number of samples we could therefore identify the model
distribution from which they were drawn (known as the population distribution) and identify
the value of parameters with a high degree of accuracy. In such situations you don’t need
statistics! Statistics all about how we draw conclusions about a population distribution when
the size of our sample is limited and how we quantify the degree of certainty that we can
have in these conclusions. We begin with a simple estimation problem.
3.1 Estimation of the mean and variance in the normal distribution
Suppose that we believe that the score on an IQ test of males aged 18 in Scotland follows a
Normal distribution with unknown mean  and unknown variance 2. You take a random
sample of size n from the set of 18 year olds and measure their IQs. We denote these values
by x1, x2, …., xn. How can you estimate  and 2 from these data?
Answer: Natural estimators to use for  and 2 are the sample mean and sample variance
respectively.
1
 xi , and the sample variance is
n
 xi 2 
1 
2

 xi 
.
n 1 
n 


The sample mean is x 
s2 
1
 x i  x 2

n 1
These are both quantities that can be calculated from the data. Suppose that our sample is of
size 10 and that  xi  973 , and  xi2 = 97822. Then we calculate the sample mean and
sample variance to be x = 97.3 and s2 = 349.9 respectively, and our estimates for the
population mean and population variance are ˆ  97.3 and ̂ 2  349.9.

Note these values are estimates - we can’t claim that they are precisely equal to the
population mean and variance because they are calculated from the data. If we carried out
the experiment again we would obtain different estimates. (We will return to the question of
how these estimates would vary between samples when we consider confidence intervals.)
3.2 Estimation of the parameters in a Gamma distribution.
In a certain test of visual processing speed the time (in ms) taken by individuals to respond to
a stimulus is believed to follow a Gamma(, ) distribution. You take a random sample of n
individuals and measure their response times. How can you estimate the values of  and 
from the data you obtain? As above, we can use the sample mean and variance to estimate
the population mean and variance. We can then use the sample mean and variance as an
estimate of the population mean and variance, -1 and -2, respectively. This gives
estimates for  and  as
ˆ 
x
, and ˆ  ˆx .
s2
Both 3.1 and 3.2 are examples of method-of-moments estimation of parameters. Given a
random sample from the population you estimate the population parameters by selecting
them to define a distribution whose mean and variance match the sample mean and variance.
Often this approach gives very intuitively very sensible results.
Other examples of method-of-moments estimation will be discussed in the class.
3.3 Quantifying uncertainty in estimates – confidence intervals.
So far we have seen how we extract estimates of population parameters from data by
matching sample moments with the population moments to be estimated. These estimates
take the form of a single value and do not give any indication of how precise they are. They
are known as point estimates.
Example: You wish to estimate the proportion of the population, p, who can roll their
tongue. In a class of 10 students you find 4 who can and estimate p to be 0.4 (using method
of moments). You colleague carries out a larger experiment and tests 1000 people randomly
chosen from the population and finds 725 who can roll their tongue. Their estimate is 0.725.
Where do think the true proportion might lie in the range (0, 1)? Clearly it would be helpful
to a third party who might be interested in the value of p if you and your colleague were able
to associate some margin of error with your respective estimates. The construction of
confidence intervals is one way in which we can do this.
3.3.1. Constructing a confidence interval for  in N(, 2) where 2 is known (a
somewhat artificial situation!!).
Suppose you observe a random sample x1, …, xn from a normally distributed population with
unknown mean  and known variance 2. We can calculate an estimate of  as ̂  x . We
now wish to associate some measure of accuracy with this estimate. The need for this arises
because the sample mean will actually vary from sample to sample. We therefore need to
understand the sampling distribution of the sample mean, which describes how it will vary
from experiment to experiment. This requires us to consider the sample mean as a random
variable, i.e.
X 
1
 Xi .
n
IMPORTANT FACT: It can be shown (see tutorial) that if Xi ~ N, 2), then
 2 
 . Note that variance is inversely proportional to the sample size.
X ~ N   ,
n 

Note that this means
X 

~ N (0, 1) .
n
It follows that in 95% of experiments (i.e. random samples of size n),
X 

will lie
n
between -1.96 and +1.96.
After some algebra (see lecture) we can show that in 95% of experiments, the value of  will
lie between X  1.96

and X  1.96

n
n
limits of a 95% confidence interval for .
. These two values define the lower and upper
Note that a confidence interval is a random interval that will vary from experiment to
experiment and will cover the true value of  (in the long-run) 95% of the time. For any
given experiment with observed sample mean, x one reports the observed confidence interval
as ( x  1.96

, x  1.96

). Since this has been obtained via a recipe which gives an
n
n
interval covering  95% of the time, then you claim to be 95% confident that this interval
contains the value of .
(See class exercise on coverage properties of confidence intervals, construction of CIs for
differing degrees of confidence.)
3.3.2. Constructing a confidence interval for  in N(, 2) where 2 is unknown. This is
much more realistic than 3.3.1. Again we assume that the data are a random sample x1, …, xn
from a normally distributed population with unknown mean  and unknown variance 2. To
get a confidence interval for  we can't just apply the above method since it relied on
knowing 2. Instead we replace unknown 2 with the estimate from the sample. Now from
standard statistical theory, we are able to identify the sampling distribution of
X 
~ t n 1
s
n
where tn-1 denotes the t distribution on (n-1) degrees of freedom. (See lectures for a picture
of what the probability density of the t density looks like).
Going through a similar argument to 3.3.1, it follows that in 95% of experiments (i.e. random
X 
samples of size n),
will lie between -tn-1(2.5) and + tn-1(2.5). This leads to a 95%
s
n
confidence interval of the form ( x  t n 1 2.5

n
, x  t n 1 2.5

n
).
The value of t n 1 2.5 depends on the value of n. As the sample size increases it tends to
1.96 from above reflecting the fact that the tn-1 distribution looks very much like the N(0, 1)
distribution
An example: We now illustrate how the method of 3.3.2 can be applied in a real situation.
Suppose a random sample of 8 students undertake a test of cognitive skill under following 24
hours abstinence from caffeine. One week later they repeat the test one hour after taking a
certain dose of caffeine. The scores achieved by the students in the two instances are
recorded in the following table.
Student
No caffeine
Caffeine
Difference
1
34
37
3
2
55
51
-4
3
23
28
5
4
34
33
1
5
41
49
8
6
42
41
-1
7
31
32
1
8
40
46
6
Assume that the results for the 8 students are independent of each other. Further assuming
that the differences in performance due to caffeine use are normally distributed, find a 95%
confidence interval for the population mean difference in score, D, when caffeine is used.
Let the differences in score be denoted by d1, ..., d8. Now for these data d = 19 and d2 =
153. From these data we calculate the sample mean and variance to be
d  2.375 and s 2  15.41 .
Taking the square root of the sample variance we obtain the sample standard deviation for the
difference to be s = 3.93. Now we need to obtain the 2.5%-point of the t distribution on 7 (=
n-1) degrees of freedom. From the table this is 2.365. We then obtain our 95% confidence
3.93
3.93
interval as (2.375  2.365
, 2.375  2.365
) or (-0.91, 5.66).
8
8
Criticise the design of this experiment. Why does it give little real evidence about the effect
of caffeine on performance?
This method can be applied to give a confidence interval for  even when the population
distribution is not normal, so long as the sample size is sufficiently large. For samples size of
30 or so, the confidence interval can be considered to be valid regardless of the exact nature
of the population distribution.
3.3.3 Constructing a confidence interval for 2 in the normal distribution
On occasions we will be interested in obtaining a confidence interval for the population
variance, 2, from a random sample of size n from a N(, 2) distribution where  and 2 are
both unknown. This may be useful, for example, when we wish to consider whether the
variability in one sample is different from another, or when the variability itself is an
important characteristic. Confidence intervals can be constructed using the  2 distribution as
follows. The method is based on the fact that the sampling distribution of the sample
variance S2 is known from standard theory.
Specifically we know that:
(n  1) S 2

2
  X i  X  ~  n21
2
(This is a standard result). Now this implies that in 95% of experiments
(n  1) S 2
2
will lie
between the 97.5% and the 2.5% points of the  n21 distribution. (See diagram in lectures).
After some algebra we can show that in 95% of experiments s2 will lie in the interval
(n  1) S 2 (n  1) S 2
( 2
,
).
 n 1 2.5  n21 97.5
When we carry out our experiment we can substitute the observed value of s2 into this
formula to get the observed confidence interval. For example, in the case where s2 is the
population variance for the difference in score when caffeine is taken (3.3.2), we have that
n = 8, s2 = 15.41,  72 (97.5)  1.69 ,  72 (2.5)  16.01 (from tables).
Substituting these values into the above confidence interval (6.73, 63.8) for s2 and (on taking
the square root) we obtain a 95% confidence interval for s as (2.59 , 7.98). It is clear with
such a small sample that we get poor accuracy in our estimate of s, and this is reflected in the
width of the confidence interval.
3.3.4. Confidence intervals for p in the binomial distribution
A common problem in the social sciences is that of estimating a binomial proportion p. That
is we take a random sample of size n from a population and count the number, X, in the
population who have a given property. From this we wish to estimate the proportion, p, of the
entire population who share the property. So long as the total population size is large
compared to n, then X ~ Bin(n, p). The most natural estimate to use for p is
pˆ 
X
.
n
We can get a confidence interval for p in the case where n is large by using the standard
result that (approximately)
p  pˆ
~ N (0,1) .
pˆ (1  pˆ )
n
pˆ (1  pˆ )
pˆ (1  pˆ )
, pˆ  2
) . This is the CI
n
n
which is typically calculated for opinion polls when the proportion of voters who e.g. intend
to vote Labour is estimated.
This gives a 95% CI for p of which is ( pˆ  2
For the tongue-rolling example at the start of 3.3, the observation that 725 out of 1000 could
roll their tongue would naturally lead to an estimate of p̂ = 0.725, and a confidence interval
0.725  0.028. That is we expect our estimate of the percentage of the population who can
roll their tongue to be accurate to within around 3%.
4. Testing Hypotheses and calculating p-values (1-sided and 2-sided tests)
While it is generally a good thing to calculate a confidence interval for a parameter since this
gives an indication of the range of possible values that it can take, many scientists and
statisticians opt to calculate p-values to quantify the strength of evidence that the data carry
about a null hypothesis. In the first section of the course we encountered a simple case of a
null hypothesis that a coin was fair, P(H) = 0.5. In psychology we are usually concerned
with null hypotheses, termed H0, that state e.g.


that a given treatment has no effect in an experiment (e.g. the caffeine experiment of
the last section)
that there is no difference in the distribution of some measurement between two
populations (e.g. maths scores between Edinburgh & Glasgow students)
Many statisticians do not like hypothesis testing on the grounds that we shouldn't expect any
null hypothesis to be true. If we can't reject it, it reflects the fact that we didn't collect
enough data to reject the hypothesis, rather than the fact that the hypothesis is true.
Nevertheless, hypothesis testing remains an important part of statistical methodology which
is much used in psychology.
As described in the coin tossing example of section 1, hypothesis testing involves several
steps including identifying a so-called test statistic (in that case the number of heads out of 20
tosses) which will be used to measure how far a given experimental outcome deviates from
what would be expected if the null hypothesis were true. The distribution of that test statistic
must be known so that the frequency with which more extreme values than the current one
will occur when H0 is true can be calculated.
Example. Consider the caffeine example of 3.3.2. Suppose that we wish to test the null
hypothesis H0: D = . Then a suitable test statistic to measure how extreme our
X 
experimental results are compared to H0 is the t-statistic,
(see lecture for discussion).
s
n
For the case, H0: D = which corresponds to no effect of caffeine on performance, and the
data shown in 3.3.2 we would calculate this value to be 1.71. Now under H0 we know that
X 
is distributed as t7 (t distribution on 7 degrees of freedom).
s
n
Whether we carry out a 1-sided or 2-sided test really depends on the alternative hypothesis
that we are considering. If we wish to compare H0 with the alternative hypothesis H0: D > 0,
(i.e. caffeine improves performance), then we would calculate a p-value
Pr(t7 ≥ 1.71) = 0.0665
If we were comparing against the general alternative that D might be greater or less than 0,
then we would need to calculate our p-value as
Pr(t7 ≥ 1.71) + Pr(t7  -1.71) = 20.0665 = 0.133.
In general you must always state which test statistic and which test you're going to do before
you collect the data.